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Preface 


Convolutional neural networks (CNNs), a type of artificial neural network 
that has been popular in computer vision, are gaining popularity in a variety 
of fields, including radiology. CNN uses several building blocks like as 
convolution layers, pooling layers, and fully connected layers to learn spatial 
hierarchies of information automatically and adaptively through backprop- 
agation. A neural network is a hardware and/or software system modelled 
after the way neurons in the human brain work. Traditional neural networks 
aren’t designed for image processing and must be fed images in smaller 
chunks. CNN’s “neurons” are structured more like those in the frontal lobe, 
the area in humans and other animals responsible for processing visual inputs. 
Traditional neural networks’ piecemeal image processing difficulty is avoided 
by arranging the layers of neurons in such a way that they span the whole 
visual field. A CNN employs a technology similar to a multilayer perceptron 
that is optimised for low processing requirements. An input layer, an output 
layer, and a hidden layer with several convolutional layers, pooling layers, 
fully connected layers, and normalising layers make up a CNN’s layers. 
The removal of constraints and improvements in image processing efficiency 
result in a system that is significantly more effective and easier to train for 
image processing and natural language processing. This Book article explains 
the core concepts of CNN and how they are used to diverse jobs, as well as 
their problems and future directions. 

Through this edited volume we have intended to provide a structured 
presentation of CNN enabled IoT applications in vision, speech, and natural 
language processing. This book discusses a variety of CNN techniques and 
applications, including but not limited to loT enabled CNN for speech de- 
noising, smart app for visually impaired people, disease detection, ECG 
signal analysis, weather monitoring, texture analysis, etc. 

Unlike other books on the market, this book covers the tools, techniques, 
and challenges associated with the implementation of CNN algorithms, com- 
putation time, and the complexity associated with reasoning and modelling 
various types of data. We have included CNN’s current research trends and 
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future directions. This edited book contains numerous new scientific results, 
and also compiles existing knowledge and research in the field. We have 
made a great effort to cover everything with citations while maintaining a 
fluent exposition, all corrections and suggestions will be greatly appreciated. 
Throughout the chapters, contributors included proper references as well as 
added all the needed diagrams and data within the content to make it easier 
for the readers to understand the concepts. 

The following are the abstract of specific chapters included in this edited 
book: 


Chapter-1: 


Convolutional Neural Networks (CNNs) have shown to be revolutionary 
in computer vision applications, frequently surpassing standard models and 
even humans in image recognition tasks. Large models place a premium on 
cutting-edge performance and frequently find it difficult to scale down. The 
bibliometric approach is also used to identify the most advanced CNN in IoT. 
The research looked at papers in four major databases using the keywords 
“Convolutional Neural Network” and “IoT” or “Internet of Things”: IEEE 
Xplore, SpringerLink, ACM Digital Library, and ScienceDirect. The Sco- 
pus database’s publications are used in the bibliometric study. The network 
citation analysis and publishing trends, in particular, provide insight into the 
current domains and development patterns for CNN in IoT. The bibliomet- 
ric research includes the most important and prolific writers, organisations, 
nations, sources, and documents. China has the most documents published, 
followed by the United States. Nanjing University of Posts and Telecommu- 
nications’ College of Telecommunications and Information Engineering has 
published the most documents in the domains of IoT and CNN. Wang Y has 
published the most documents in this domain in the category of writers. In 
the fields of IoT and CNN, IEEE Access has the most papers. 


Chapter-2: 


The internet of things (IoT) has shown to be beneficial for the interconnection 
of computing devices embedded in items, allowing objects to transmit and 
receive data via the internet for day-to-day activities. It uses unique identities 
to connect computers equipment with mechanical, electronic, and other items 
such as animals or humans, allowing data to be transferred over a network 
without the need for human or computer involvement. Artificial intelligence 
(AD, deep learning, and machine learning are all being used to make data 
collection easier and more dynamic for loT technologies. However, there 
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is no categorical assessment of IoT research, particularly in a well-known 
research database, to identify the main fields where IoT is widely used and 
where it faces greater problems. As a result, this chapter discusses how CNN- 
enabled IoT techniques can be used in a variety of industries. The advantages 
and disadvantages of a CNN-enabled IoT-based system are highlighted. It 
summarises the study by field of application and identifies any gaps in the 
research. More partnerships between researchers and users/experts in areas 
where IoT applications are employed to overcome restrictions, improve IoT 
technology, and provide insight into the security of future work in cyberspace 
are recommended in this chapter. 


Chapter-3: 


Speech-controlled smart devices have recently become popular in Internet- 
of-Things (IoT) applications. Reverberation and noise are widely known for 
reducing the efficiency of human-machine interaction in indoor applications. 
As aresult, speech augmentation has emerged as a significant front-end strat- 
egy for improving performance, garnering increased attention in recent years. 
This chapter focuses on single-channel speech augmentation algorithms 
based on deep learning (DL) for both denoising and dereverberation, with 
single and multiple speakers considered. Due to their parameter effectiveness 
and state-of-the-art performance, convolutional neural network (CNN) based 
models are provided for this difficult speech augmentation job. Following 
a description of one-stage and multi-stage CNN-based models, a series of 
experiments are carried out to demonstrate the benefits and drawbacks of 
using them to extract one desired speaker and multiple desired speakers. 
This research shows that CNN-based models can achieve great performance 
when only one desirable speaker is to be extracted, but that their performance 
degrades dramatically when numerous desired speakers are to be removed. 
Finally, future research topics are detailed after some potential ways for 
increasing the performance of extracting numerous desirable speakers are 
discussed. 


Chapter-4: 


In today’s extravagant period, one of the symptoms of passionate newline in 
sight is the ability to recognise feelings, an element of human understanding 
that has been argued to be much more important than scientific and verbal 
intelligences. Newline workers have lost their interest or concentration in 
work activity, as well as their focus or newline performance in the work- 
ing environment, as a result of gradual enrichment in IoT technology for 
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smart newline environment level, technology disruptions, and degradation of 
performance in the industries. Furthermore, despite the rapid rise of IoT, in 
the newline field of modern intelligent services, current loT-based systems 
notably lack cognitive newline intelligence, implying that they are unable to 
meet the needs of industrial services. 

Newline Deep learning has become one of the most widely used 
approaches in a variety of machine learning applications and studies. While 
it is mostly used for content-based newline image retrieval, it can still be 
improved by using it in a variety of computer vision newline applications. 
According to the findings of a rigorous theoretical and practical analysis, 
a newline urgent need to address this issue by developing an emotional 
intelligent approach, Machine newline learning (deep learning, CloT), which 
will mentor and counsel workers by monitoring their newline behaviour in 
the workplace. Newline The goal of this research was to develop a CNN 
model and controller area protocol-based emotional intelligence system (EIS) 
to automatically categorise expressions in the Facial Expression Recognition 
newline (FER2013) and kaggel picture databases. 


Chapter-5: 


According to the World Health Organization (WHO), millions of visually 
impaired people confront significant challenges in travelling independently 
around the world. They are always in need of assistance from folks who 
have normal vision. For visually impaired people, finding their way to their 
intended destination in a new environment is a huge difficulty. This report 
was written to aid these people in overcoming their difficulties in migrating 
to a new location on their own. To that end, we presented a way for visually 
impaired people to recognise the situation and scene elements automatically 
in real-time using a Convolutional Neural Network (CNN). Raspberry Pi, 
Ultrasonic Sensors, a camera, breadboards, Jumper wires, a buzzer, and an 
earphone are all part of the proposed system. 

With the use of a Raspberry Pi and jumper wires, breadboards are utilised 
to connect the sensors. The sensors detect barriers and potholes, while the 
camera acts as a virtual eye for visually challenged persons, recognising 
these impediments from any direction (i.e., front, left and right). This system 
has a vital feature: when the blind receives the scene object, the system 
automatically calculates how far away he is from the obstacles, and a voice 
message informs and leads him through earphone. The CNN produced excel- 
lent results of 99.56 percent accuracy and a loss validation of 0.0201 percent, 
according to the collected testing findings. 
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Chapter-6: 


Communication systems based on machine-to-machine contact are defined 
using Internet of Things (IoT)-based technologies. When the Internet of 
Things (IoT) is combined with a Convolution Neural Network (CNN), a 
system that can communicate with its environment using human speech 
can be created. Natural language processing (NLP) can work in conjunction 
with IoT-based deep learning systems to help with automation development. 
The Internet of Things (IoT) can connect a network of specialised devices 
and use deep learning to extract information such as sensor features, radio 
frequency features, and speech features. Speech-based recognition systems 
for home automation systems can be developed using IoT and NLP. To 
transmit particular orders to IoT devices, smart home applications can be 
integrated with voice command-based IoT devices. Furthermore, NLP-based 
IoT devices can assist impaired people in carrying out their regular tasks. 

These devices can track their health and make voice-activated security 
changes. In addition, NLP-enabled IoT devices can aid in the automa- 
tion of environmental data collecting, which includes geographic activities. 
However, language understanding, accent change, and voice change are all 
difficulties of NLP-based IoT application. These issues limit the efficiency 
and speed with which NLP-based IoT devices can be used. Deep learn- 
ing technology combined with a large vocabulary library has opened up a 
plethora of possibilities for IoT speech and command recognition system 
training. CNN gadgets with IoT connectivity for voice recognition are a gift 
to society. 


Chapter-7: 


In therapeutic therapy, classification of ECG signals is crucial. Traditional 
methods have reached their limits of effectiveness, yet there is always room 
for improvement. The main purpose of this study is to automatically cate- 
gorise and detect myocardial infarction using ECG signals. Deep Learning 
methods such as Convolutional Neural Network (CNN) and Long Short Term 
Memory (LSTM) algorithms were used in the proposed model Enhanced 
Deep Neural Network (EDN). On huge matrices of data, vector operations 
like matrix multiplication and gradient descent were performed in parallel 
with GPU support. In EDN, parallelism reduces the amount of time it takes 
for a procedure to run. The suggested model EDN has a greater accuracy 
(88.89%) than prior state-of-the-art techniques for the PTB database. 

The EDN is 10 times faster than the LSTM due to the speed with which it 
converges during training. According to the algorithms’ confusion matrices, 
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the EDN achieved an accuracy of 87.80%. The recommended model dis- 
plays performance improvisation based on metrics such as Precision, Recall, 
F1 measure, and Cohen Kappa Coefficient. These improvements to EDN’s 
performance would help to save human lives. 


Chapter-8: 


In deep learning, image annotation is a difficult task. It is difficult for 
machines to recognise objects without the annotation. The major purpose 
of this chapter is to improve an automatic annotation system concept that 
comprises a pre-trained semantic segmentation model and the MATLAB 
Image Labeler Tool. In this chapter, we use pixel-wise semantic segmentation 
to automatically annotate Synthetic Aperture Radar photos of oil spills, which 
is possibly the most common task in computer vision. Due to their excellent 
feature representation, deep learning-based convolutional neural networks are 
currently redriving tremendous breakthroughs in semantic segmentation. To 
construct an automation algorithm, this proposed method uses a pre-trained 
DeepLabv3plus and Resnet18 as a backbone model. Each pixel in an input 
image is assigned to a separate category by DeepLabv3plus. 

Image Labeler is being used to construct an automated algorithm for 
labelling oil spill photographs automatically. The article’s originality stems 
from the usage of pre-trained deeplabv3+ as the backbone for image annota- 
tion using the Image Labeler function to increase the system’s generalisation 
capacity. Over the oil spill SAR dataset, broad studies of proposed semantic 
image segmentation divisions utilising Resnet18 as backbone are conducted, 
and the findings are accurate when compared to Xception and Mobilenet 
backbone models. 


Chapter-9: 


Weather is a dynamic, chaotic, and multidimensional phenomenon that occurs 
in real time. Because of these qualities, forecasting and monitoring the 
weather is challenging. Wireless devices are critical components not only for 
businesses’ development control, but also in day-to-day living for monitoring 
building security and movement streams, as well as estimating common met- 
rics. Temperature, dampness, and weight are all components to be evaluated 
for this wander in atmosphere monitoring, thus sensors have been given the 
task of accomplishing everything considered. For purchasers and present-day 
applications, data gathering structures are well-known. 

The proposed shape includes three sensors that process unusual factors 
as described above, as well as a rain fall recognisable evidence and storm 
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bearing tempo estimation environment tool that is added to the stored data 
and compared to the previous 60 years of weather data to predict future 
weather using convolutional Neural Networks. Meteorologists used a range of 
methods for forecasting weather in the past, ranging from simple temperature, 
rain, air pollution, and moisture readings to complex computerised math- 
ematical models. Convolutional Neural Networks (CNNs) are a powerful 
data modelling technique that uses deep learning to capture and represent 
complicated input/output interactions. Convolutional neural networks’ real 
strength and advantage is their ability to simulate both linear and nonlinear 
relationships directly from data. 

The quality and performance of these algorithms are assessed using an 
experimental approach in MATLAB 2013a. When compared to the conven- 
tional method, the use of convolutional neural networks delivered the most 
accurate weather prediction. The modelling findings, for the most part, reveal 
that reasonable forecast accuracy was achieved. 


Chapter-10: 


Efficient E-Learning (EL) environments for online learners are the respon- 
sibility of higher educational systems. Learners are engaged in instructional 
activities in an effective learning environment. The chapter is divided into 
three sections; the first section covered the Convolutional neural network 
(CNN). CNN includes a variety of models, but for the sake of this study, 
we used three models that we discovered to be the most effective in mea- 
suring students’ SEt (SEt) in EL tasks. Because they have simple network 
architectures and exhibit efficiency in conditions and categories, we used 
All Convolutional network (AII-CNN), Network-in-Network (NiN-CNN), 
and Very Deep Convolutional network (VD-CNN). These classifications are 
based on the circumstances of learners’ facial expressions in an online setting. 

The application of Machine Learning (ML) and Artificial Intelligence 
(AD in EL is covered in the second part of the chapter. This chapter aims 
to illustrate the benefits of machine learning and artificial intelligence (AD) 
in E-Learning (EL) in general, as well as how the King Khalid University 
(KKU) EL Deanship is utilising AI and ML in its operations. In addition, 
academics have concentrated on the future of machine learning and artificial 
intelligence in any academic programme. The third section of the chapter 
delves into the role of the Internet of Things (IoT) in the education sector, 
outlining the benefits, types of security risks, and deployment obstacles. 

The results for three CNN models are referred to for their advantages and 
challenges for online learners, while the results for ML and Al are based on 
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qualitative analysis done through EL tools and techniques applied in KKU 
as an example, but the same modus operandi can be implemented by any 
institution in its EL platform. KKU uses Learning Management Systems 
(LMS) to provide online learning practises and Blackboard (BB) to share 
online learning resources, thus the researchers used these technologies to 
illustrate the findings. loT has transformed the learning environment for both 
students and educators, according to the findings. 


Chapter-11: 


Texture analysis is important in computer vision since it is required for 
both the characterization and segmentation of image regions. It’s used in 
a variety of technological fields, including food processing, materials char- 
acterization, remote sensing, and medical picture analysis, among others. 
Deep learning has redefined the state of the art in image recognition, and 
by extension, texture analysis and quantification, over the previous decade. 
Transfer learning with convolutional neural networks is driving much of the 
current developments for a variety of reasons. 

Deep learning with convolutional neural networks will be discussed in 
this chapter, as well as a comparison of textures using different neural 
network topologies and traditional methodologies. Three case studies will be 
taken into consideration. In the first, Voronoi simulations of material textures 
will be examined, as well as convolutional neural networks’ capacity to 
distinguish between different textures. In material science, these textures are 
crucial in the creation of geometallurgical and quantitative structure property 
connection models. 

In the second case study, transfer learning will be used to demonstrate 
how froth imaging sensors in the mineral processing sectors may be con- 
siderably improved. Breakthroughs in this field could have a direct impact 
on flotation plant advanced real-time control. Textures linked with stochastic 
signals imitating stock price data will be studied in the final case study, and 
it will be demonstrated that these methods may be used to monitor minor 
changes in stock price data or any other signal more broadly. 


Chapter-12: 


By allowing data to be collected through various loT-based devices and 
sensors, the Internet of Things (IoT) has revolutionised healthcare systems. 
These devices generate data in a variety of formats, such as text, images, 
and videos. As a result, getting accurate and useable data from the massive 
amounts of data generated by the Internet of Things is a difficult task. The 
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diagnosis of various diseases utilising IoT data has lately arisen as an issue 
that requires sophisticated and effective approaches. It is difficult to make 
a good diagnosis due to the wide range of disease symptoms and signs. 
Currently available methods rely on either handmade features or a traditional 
machine learning model. As a result, the application of IoT-based enabled 
Convolutional Neural Network (CNN) in healthcare diagnostics is discussed 
in this chapter. The advantages and disadvantages of IoT-enabled CNN are 
highlighted. The chapter offers an intelligent IoT-based enabled CNN for 
determining a patient’s health state. The CNN was utilised to diagnose data 
captured using IoT-based sensors for the ailment. As a result, the system 
makes use of the dataset’s and CNN’s features, ensuring high accuracy and 
reliability. The recommended system illustrates real-time health monitoring 
and evaluates the system’s performance in terms of several metrics such as 
accuracy, recall, precision, and Fl-score for a case study on healthcare dataset 
classifications. 
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Abstract 


Convolutional neural networks (CNNs) have shown to be ground breaking 
in computer vision applications, routinely outperforming standard models 
and even humans in image identification challenges. CNNs are frequently 
tested on computer vision tasks, although their effects are far-reaching. When 
deploying a CNN on resource-constrained IoT devices, there are two options: 
scale a big model down or utilize a tiny model developed particularly for 
resource-constrained settings. Small designs usually sacrifice accuracy for 
computational expense by using depth-wise convolutions rather than conven- 
tional convolutions as in big networks. Large models prioritize cutting-edge 
performance and frequently struggle to scale down enough. The goal of this 
study is to thoroughly examine and analyze the research landscape on CNN 
in Internet of Things. The bibliometric analysis is also done for detecting the 
state-of-the-art CNN in IoT. The approach is capable of describing trends 
of publishing within a specific time or body of literature by employing 
quantitative analysis and statistics. The study looked at publications with 
the keywords “convolutional neural network” and “IoT” or “Internet of 


2 Convolutional Neural Networks in Internet of Things: A Bibliometric Study 


Things” in four major databases: IEEE Xplore, SpringerLink, ACM Digital 
Library, and ScienceDirect. The publications of the Scopus database are 
considered for the bibliometric analysis. A total of 1286 documents from 
2015 to 2021 are used for the bibliometric analysis. The publishing patterns 
and network citation analysis, in particular, shed insight on the present 
domains and development patterns for CNN in IoT. The most significant and 
prolific authors, organizations, nations, sources, and documents are included 
in bibliometric study. The co-occurrence of all keywords is also examined 
in order to identify the most promising study topics in the field of CNN 
in IoT. The results of the study revealed that the number of papers pub- 
lished each year has grown rapidly on an average of 100-150 papers each 
year. China has published the maximum number of documents followed by 
United States. College of Telecommunications and Information Engineering, 
Nanjing University of Posts and Telecommunications, Nanjing, China has 
published the maximum documents in the domain of IoT and CNN. In the 
authors’ category, Wang Y. has published the maximum number of documents 
in this domain. JEEE Access has maximum papers in the domain of loT and 
CNN. This research will assist academicians and practitioners in gaining a 
thorough grasp of the current state and developments in this sector. 


Keywords: Internet of Things, convolutional neural network, bibliometric 
analysis, VOS viewer. 


1.1 Introduction 


Internet of Things (IoT) has sparked a lot of attention in recent years. For 
example, an artificial-intelligence-based loT robot might be utilized in a 
number of surveillance applications. Convolutional neural networks (CNNs) 
perform exceptionally well in a variety of machine learning applications. 
As IoT devices integrated with sensors pervade every part of contemporary 
life, requirement to run CNN analysis, a computationally intensive function, 
on devices that are resource constrained is becoming a requirement. CNNs 
are widely utilized in image and video processing applications. Despite their 
exceptional performance, CNNs are computationally demanding. There have 
been several suggestions for speeding CNNs with application-specific inte- 
grated circuit (ASICs), field programmable gate array (FPGAs), and graphics 
processing unit (GPUs). Due to their great performance and ease of use, 
GPUs are the primary platform for accelerating CNNs. The advancement of 
IoT enables the connection of smart devices over the Internet. Advances in 
IoT technology hold enormous potential for higher quality, more efficiency, 
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and intelligent operations. Research on smart IoT operation solutions is 
gaining popularity as loT technology advances. Recent study indicates that 
IoT has even more potential uses in information-intensive industrial areas. 

Other expectations, including autonomous installation, automation sys- 
tem, and optimal functioning, are developing in the IoT service system, 
in contrast to inter-object communication support. Despite the fact that it 
enables Internet connectivity and automation features through pre-setting, 
it is challenging to sustain continuous functioning and value generation in 
the application area. To overcome these issues, user monitoring and action 
are required. The growth and rising popularity of various computational 
technologies (machine learning and deep learning) allows a wide variety 
of previously unsolved smart applications and difficulties to be considered 
again. The smart IoT service system is defined as a system that receives input 
from its surrounding, utilizes the data gathered to detect the situation, and 
interacts with the user environments using service regulations and specialized 
knowledge. 

Machine-learning-based subsystems are rapidly being used in IoT edge 
devices, necessitating resource-efficient designs and implementations, partic- 
ularly in battery-constrained settings. Since CNNs are non-exact, estimate 
computations can be utilized to reduce the required runtime and power usage 
on resource-constrained IoT edge devices without substantially affecting 
prediction performance. A CNN is a sort of ML technique that can tackle a 
wide range of issues, including visual processing and robust feature extraction 
for recognition and detection. 

This chapter is structured as follows. Section 1.2 describes the related 
work, Section 1.3 demonstrates the research objectives, and Section 1.4 
explains the overview of the selected studies for review followed with Sec- 
tion 1.5 covering the overview of bibliometric analysis. Section 1.6 explains 
the methodology followed for bibliometric analysis. Section 1.7 demonstrates 
the limitations and future work followed by conclusion in Section 1.8. 


1.2 Related Work 


Chen and Deng (2020) analyzed the research progress in the field of CNNs 
using the bibliometric method. A simple statistical analysis and co-citation 
network was used to examine textual extracts from CNNs. Experiments 
demonstrate that CNNs are used in a variety of computer vision applica- 
tions, including fault and image identification diagnostics, seismic detection, 
location, and automatic detection of fractures and signals, image analysis, 
and pattern classification, whereas Nakhodchi and Dehghantanha (2020) 
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presented a detailed overview of the use of deep learning in cybersecurity 
studies and gap bridging. During 2010 and 2018, around 740 papers from the 
ISI Web of Science database were evaluated. The number of articles as well 
as the number of citations are addressed utilizing bibliometric analysis. The 
authors also provided analyses based on nations and continents, study topics, 
authors, institutions, phrases, and keywords. 

Sakhnini et al. (2021) offered a bibliometric review of scholarly pub- 
lications focusing on the security concerns of loT-aided smart grids. All 
articles and journals were subjected to a bibliometric analysis, with the 
results organized by chronology, author, and important topics. Moreover, the 
researchers review the numerous cyber-threats confronting the smart grid, 
many security techniques presented in the research, and the research needed 
in the area of intelligent grid security. Gamboa-Rosales et al. (2020) presented 
a bibliometric analysis presented a bibliometric analysis of the decision- 
making using Internet of Things and machine learning. According to their 
study, Hanumayamma Innovations and Technologies, Inc., Beijing University 
of Posts and Telecommunications, Vellore Institute of Technology, Vellore, 
and others are among the most productive enterprises. In this regard, the most 
prolific nations are, in order, India, the United States of America, China, 
the United Kingdom, Canada, Spain, France, and Germany. The relation- 
ship between the most prolific organizations and nations demonstrates the 
diversity and high quality of the articles covered by this study topic. 


1.3 Research Questions 


The current research examines the bibliometrics of Internet of Things enabled 
convolutional neural networks. The following research questions have been 
formulated to achieve the stated goal: 

RQ1: In terms of publishing output, what are the general publishing patterns 
from 2015 to October 2021? 


RQ2: Which countries have the maximum number of documents and 
citations in the domain of loT-enabled CNN? 


RQ3: Which organizations have the maximum number of documents in the 
domain of loT-enabled CNN? 

RQ4: Who has published the maximum number of articles in the area of 
IoT-enabled CNN? 

RQ5: Which journals, proceedings, or book chapters have the maximum 
number of articles for the topic loT-enabled CNN? 
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RQ6: Which document has the maximum number of citations? 


RQ7: Which keywords occur the maximum number of times and what are 
the hot research topics? 


1.4 Literature Review 


Field monitoring is a critical function in the IoT, in which several loT nodes 
collect data and transmit data to a ground unit or the cloud for computation, 
reasoning, and evaluation. Whenever the observations are high-dimensional, 
this transfer gets costly. loT networks with limited bandwidth and low-power 
gadgets are difficult to manage such regular high broadcasts. 

There are several fields in which the combination of both loT and CNN 
technologies has been applied. Some of them are listed below. 

Shin et al. (2019) presented an loT solution outfitted with a smart 
surveillance robot that employs machine learning to bypass the constraints 
of traditional CCTV. The customer is provided an app that enables customers 
to directly control the robot. CNN-based ML is advantageous in the setting 
of image context analysis, which is provided for the exact identification 
and classification of pictures or pattern, and with which a high recognition 
performance may be predicted. 

Njima et al. (2019) created a localized architecture that moves the online 
forecasting difficulty to an offline pre-processing phase, based on CNN. Its 
goal is to precisely pinpoint a sensor node by calculating its geographical 
area. Received Signal Strength Indicator (RSSI) fingerprints are used to create 
3D radio pictures. The model findings verify the use of various settings, 
optimization methodologies, and model architectures. 

Song et al. (2021) proposed FDA3, a powerful federated defense tech- 
nique that can aggregate defensive knowledge against hostile samples from 
several sources. Their suggested cloud-based architecture, inspired by fed- 
erated learning, allows HoT devices to share security capabilities against 
various assaults. 

Apart from the above-mentioned domains of defense and malware clas- 
sification, there are various other domains like healthcare (Niu et al., 2021), 
image classification (Li et al., n.d.), agriculture (Shylaja et al., 2021), etc., in 
which the integration of loT and CNN plays a vital role. 

Niu et al. (2021) developed an approach for medical image categoriza- 
tion based on Distance Domain Transfer Learning (DDTL). They devel- 
oped a DDTL model for COVID-19 diagnosis using unlabeled Office-31, 
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Catech-256, and chest X-ray image large datasets as original data and a small 
batch of COVID-19 lung CT image data as target domain. CNN, Alexnet, and 
Resnet were chosen as non-transfer supervised foundation systems. 

Li et al. (n.d.) constructed a deep-learning-based IoT image identification 
system that uses CNN to build image processing algorithms and PCA and 
LDA to identify picture features. 

Shylaja et al. (2021) proposed a new machine-learning-enabled smart IoT 
media to help the agriculture sector. They developed an Intelligent Crop Mon- 
itoring Device (ICMD) to analyze crops in the crop land around the clock, 
seven days a week. This type of monitoring gadget improves agricultural 
productivity and service quality, as well as associated products. 


1.5 Overview of Bibliometric Analysis 


A statistical examination of published scientific articles, books, or book 
chapters is described as bibliometric analysis, and it is an effective approach 
of quantifying the influence of publication in the research world (de Moya- 
Anegón et al., 2007). The number of times a piece of research has been 
mentioned by other writers may be used to determine its academic influence. 
A bibliometric analysis or citation classics research design is a popular 
method for determining an item’s effect. 

A bibliographic analysis is a statistical examination of books, papers, or 
other printed materials. In recent years, bibliometric analysis has been used 
to demonstrate the state, features, history, and growing patterns of knowledge 
in a range of professional disciplines. This can help interested academicians 
who lack proficiency in some disciplines get a complete understanding. From 
a micro to macro standpoint, bibliometrics may express a substantial amount 
of academic research. When the scope of review is extensive, bibliometric 
analysis should be performed on bigger datasets (Donthu et al., 2021). Biblio- 
metric techniques may be used to clearly examine the performance of several 
disciplines. As a result, the VOSviewer clusters will be employed in this 
effort to conduct a thorough and systematic evaluation of loT-enabled CNN 
research. VOSviewer is a software for creating and viewing bibliometric 
maps (Van Eck, N. J., & Waltman, L. , 2009). The bibliometric research group 
can use the software for free. VOSviewer may be used to generate co-citation 
maps for authors or journals, as well as keyword co-occurrence maps. 

Scholars employ bibliometric analysis to achieve a range of objectives, 
including detecting developing alterations in article and journal quality, coop- 
eration patterns, and study characteristics, as well as studying the cognitive 
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architecture of a certain field in the recent research. The information extracted 
through bibliometric analysis is generally very large, but its explanations are 
commonly reliant on both impartial and interpretive assessment founded via 
educated methods and practices. In another terms, bibliometric analysis can 
be utilized to systematically make sense of huge volumes of unorganized 
information in order to grasp and map the cumulative scientific information 
and developmental intricacies. As a consequence, well-conducted bibliomet- 
ric experiments can build a firm base for progressing a profession in novel and 
constructive methods, which facilitates and enables researchers to (1) obtain 
a one-stop summary, (2) recognize weak areas, (3) develop new research 
concepts, and (4) position their intentional commitments to the field. 

As a result of the preceding discussion, it can be concluded that bib- 
liometric studies are important for presenting trends in the study’s spheres 
and that they are conducted on a regular basis to report on the top authors, 
publication venues, top organizations and countries, citation landscape, and 
other related information, as well as research trends in the domain under 
study. As a consequence, the goal of this research is to give a full analysis 
of IoT-enabled CNN through bibliometric analysis of the relevant literature. 


1.6 Methodology for Bibliometric Analysis 
1.6.1 Database Collection 


Web of Science (https: //www.webofknowledge.com/ ), Scopus (https: //ww 
w.scopus.com/), Google Scholar (https://scholar.google.com/), and Scimago 
(https://www.scimagojr.com/) are all well-known databases that contain a 
diverse spectrum of academic articles. We searched the bibliometric literature 
on IoT-enabled CNN to locate the best digital library. Clarivate Analytics Web 
of Science and Scopus are the most often used databases for analyzing and 
acquiring citation data from the literature. For the bibliometric survey, this 
study relied on Scopus, the most well-known database. 


1.6.2 Methods for Data Extraction 


Around 1286 papers were collected from the Scopus database, together with 
their publishing metadata. The data was exported in CSV format from Scopus 
and entered into the VOS viewer for bibliometric analysis. 

The following are the queries used for searching Scopus for documents: 


Search Parameters: TITLE-ABS-KEY (internet AND of AND things AND 
convolutional AND neural AND networks) AND (LIMIT-TO DOCTYPE, 
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“cp”) OR LIMIT-TO (DOCTYPE, “ar”) OR LIMIT-TO (DOCTYPE, “ch”) 
AND (EXCLUDE (PUBYEAR, 2022)) AND (LIMIT-TO (PUBSTAGE, 
“final”)) AND (LIMIT-TO (LANGUAGE, “English”)). 

The inquiry is limited to articles, book chapters, and conference papers. 
Review papers are not included since they summarize carefully chosen 
scientific knowledge (Ellegaard & Wallin, 2015). This information is often 
transmitted in the literature, along with a complete bibliography for the 
subject. Bibliometric research, on the other hand, concentrates on statistical 
data and is almost never used in combination with a field bibliography. 

The keywords are important since they provide context for the study 
topic. As a result, the keywords “blockchain” and “healthcare” are used to 
get documents from the Scopus database. The documents are sorted by year, 
language, and document type. From 2015 to October 2021, papers published 
in journals, conferences, and book chapters in English were used for this 
investigation. 


1.6.3 Year-Wise Publications 


The yearly pattern of publication activity allows for an examination of the 
stage of growth, knowledge gathering, and maturity of loT enabled CNN. 
Because there are relatively few publications before 2015, only papers 
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Figure 1.1 Year-wise publications. 
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from 2015 to 2021 are evaluated. After 2015, the number of publications 
concerning IoT-enabled CNN increased from 7 in 2016 to 28 in 2017, 123 in 
2018, 303 in 2019, 439 in 2020, and 384 in 2021 through October 31st, 2021. 
Figure 1.1 depicts a graph of publications in the field of loT-enabled CNN. 


1.6.4 Network Analysis of Citations 


1.6.4.1 Citation analysis of countries 

The citation criteria were satisfied by 25 nations out of 171, with a minimum 
of 5 papers per nation and a minimum of 100 citations per nation. China 
has the maximum number of documents as 473 with total citations of 2614, 
followed by the United States with 233 documents with total citations of 2115 
followed by India with 202 documents with total citations of 769. Citation 
network analysis in respect of countries is shown in Figure 1.2. 

Table 1.1 shows the top 25 countries with maximum documents. 


1.6.4.2 Citation analysis of organizations 
With a minimum of 2 documents per organization and a minimum of 50 
citations per organization as a threshold, 18 organizations out of 2797 
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Figure 1.2 Country-specific citation network analysis. 
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Table 1.1 Top 25 countries in terms of documents 


Country Documents Citations Total link strength 
China 473 2614 143 
United States 233 2115 70 
India 202 769 38 
South Korea 71 448 43 
United Kingdom 65 1082 43 
Saudi Arabia 53 300 26 
Taiwan 52 327 18 
Canada 45 407 17 
Australia 44 356 12 
Italy 44 576 37 
Japan 37 408 20 
Singapore 32 232 10 
Brazil 29 291 38 
Pakistan 29 434 27 
Egypt 26 175 

France 25 192 8 
Spain 23 433 20 
Bangladesh 19 136 0 
Qatar 19 106 2 
Turkey 13 135 14 
Hong Kong 11 122 9 
Switzerland 11 318 19 
Austria 9 209 7 
Belgium 9 183 0 
Algeria 6 138 10 


satisfied the requirement. College of Telecommunications and Information 
Engineering, Nanjing University of Posts and Telecommunications, Nanjing, 
China, has six documents. Citation network analysis in terms of organizations 
is represented in Figure 1.3. 

Table 1.2 shows the top 18 organizations in terms of number of documents 
published. 


1.6.4.3 Citation analysis of authors 

Using a minimum of 5 documents by the author and a minimum number 
of 100 citations, 28 meet the thresholds out of 3874 authors. Wang Y. has a 
maximum number of documents that is 39 with total citations of 409 followed 
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by Li J. having 25 documents with 250 citations. Citation network analysis in 
terms of authors is represented in Figure 1.4. 
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Figure 1.3 Citation network analysis in terms of organizations. 


Table 1.2 Top organizations 
Organization Documents | Citations | Total link strength 
College of Telecommunications and 6 78 0 
Information Engineering, Nanjing 
University of Posts and 
Telecommunications, Nanjing 210003, 
China 
Department of Electrical Engineering 5 113 0 
and Computer Sciences, University of 
California, Berkeley, Berkeley, CA, 
United States 
Department of Information Systems 4 54 0 
and Cyber Security, University of Texas 
at San Antonio, San Antonio, Tx 
78249, United States 


Department of Software, Sejong 4 149 5 
University, Seoul 143-747, South Korea 
Department of Software Engineering, 3 59 0 


College of Computer and Information 
Sciences, King Saud University, Riyadh 
11543, Saudi Arabia 


Institute of Microelectronics, Tsinghua 3 56 2 
University, Beijing 100084, China 
Intelligent Media Laboratory, Digital 3 149 3 


Contents Research Institute, Sejong 
University, Seoul 143-747, South Korea 


(Continued) 
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Table 1.2 Continued. 
Organization Documents | Citations | Total link strength 
School of Electrical and Electronics 3 89 0 
Engineering, Nanyang Technological 
University, Singapore 


Department of Computer Engineering, 2 59 0 
College of Computer and Information 
Sciences, King Saud University, Riyadh 
11543, Saudi Arabia 


Department of Computer Science, 2 138 0 
Guelma University, Guelma 24000, 

Algeria 

Department of Electrical Engineering 2 92 0 


and Computer Science, Syracuse 
University, Syracuse, NY, United States 


Department of Electrical Engineering, 2 92 0 
City University of New York, City 
College, NY, United States 
Department of Electrical Engineering, 2 55 0 
University of Southern California, Los 
Angeles, CA 90089, United States 
Department of Electrical Engineering, 2 92 0 
University of Southern California, Los 
Angeles, CA, United States 

Graduate Program in Applied 2 74 2 
Informatics, Universidade De Fortaleza, 
Fortaleza 60811-905, Brazil 


Integrated Systems Laboratory, Eth 2 113 2 
Zurich, Ziirich 8092, Switzerland 

School of Computer Science, South 2 54 0 
China Normal University, Guangzhou, 

China 

University of Washington, Seattle, WA, 2 191 0 


United States 


Table 1.3 shows the top authors based on the number of documents 
published. 


1.6.4.4 Source citation analysis 

The source citation analysis is performed using a threshold of 3 documents 
per source and a minimum of 50 citations per source. Only 18 of the 
556 sources fulfill the criteria. JEEE Access has the maximum number of 
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documents, that is, 97 and total citations of 937 followed by IEEE Internet 


of Things Journal with 84 documents and total citations of 684. Citation 
network analysis in terms of sources is shown in Figure 1.5. 
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Figure 1.4 Citation network analysis in terms of authors. 


Table 1.3 Top authors based on the number of documents published 


Author Documents Citations Total link strength 
Wang Y. 39 409 19 
Li J. 25 250 4 
Li X. 21 149 7 
Zhang Y. 21 135 5 
Li Y. 20 204 4 
Yang J. 18 219 12 
Zhao Y. 16 368 8 
Zhang H. 13 102 

Zhou Y. 13 149 0 
Benini L. 12 315 19 
De Albuquerque V.H.C. 12 188 11 
Li Z. 12 177 0 
Wang C. 11 104 3 
Chen Z. 10 243 1 


(Continued) 
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Table 1.3 Continued. 


Author Documents Citations Total link strength 
Li W. 10 127 0 
Yuan B. 8 165 0 
Conti F. 7 202 10 
Zhang Q. 7 153 1 
Muhammad K. 6 171 9 
Ren A. 6 164 0 
Rossi D. 6 243 18 
Ding C. 5 119 0 
Gui G. 5 122 10 
Kumar N. 5 131 0 
Li L. 5 174 0 
Qiu Q. 5 153 0 
Spanos C.J. 5 113 0 
Zou H. 5 113 0 


iee@gransactiqgjs on industria 
actions on vehicul 
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. 
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Figure 1.5 Citation network analysis in terms of sources. 


Table 1.4 shows the top sources on the basis of the published documents. 


1.6.4.5 Citation analysis of documents 

A document's citation analysis is achieved by taking into account a threshold 
of 70 citations per document. Only 20 papers out of 1348 passed the test. 
Deep Sense: A Unified Deep Learning Framework for Time-Series Mobile 
Sensing Data Processing (Yao et al., 2017) has the maximum citations, 
that is, 218. Citation network analysis in terms of documents is shown in 
Figure 1.6. 
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Table 1.4 Top sources that publish most documents 


Source Documents | Citations | Total link strength 
IEEE Access 97 937 16 
IEEE Internet of Things Journal 84 684 11 
ACM International Conference 47 57 1 
Proceeding Series 

Sensors (Switzerland) 22 236 3 
IEEE Transactions on Industrial 19 503 6 
Informatics 

2020 IEEE International 17 59 0 


Conference on Informatics, IoT, 
And Enabling Technologies, ICIOT 


2020 

Procedia Computer Science 15 161 0 
Wireless Communications and 10 93 0 
Mobile Computing 

Computer Communications 8 82 2 
Future Generation Computer 8 62 3 
Systems 

IEEE International Conference on 8 75 1 
Communications 

IEEE Sensors Journal 8 55 0 
IEEE Transactions on 7 220 3 


Computer-Aided Design of 

Integrated Circuits and Systems 
IEEE Transactions on Circuits and 5 203 3 
Systems I: Regular Papers 


Security and Communication 5 50 0 
Networks 

Computer Networks 4 102 1 
IEEE Journal of Biomedical and 4 198 2 
Health Informatics 

IEEE Transactions on Vehicular 4 113 4 
Technology 


Table 1.5 shows the author names of the documents with maximum 
citations. 


1.6.5 Co-occurrence Analysis for KEYWORDS/HOT RESEARCH 
AREAS 


Since one keyword frequency assessment may immediately and successfully 
define fields of study and fundamental material of a certain issue, this article 
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created some keyword co-occurrence networks, thus marking the areas of 
loT-enabled CNN research. 
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Figure 1.6 Document-based citation network analysis. 


Table 1.5 Top documents based on the citations 


Document Citations Links 
Yao (2017) 218 0 
Lopez-Martin (2017) 205 0 
Ravi (2017) 197 0 
Iyer (2016) 174 0 
Li (2018) 165 0 
Ferrag (2020) 137 0 
Li (2018) 126 0 
Lane (2015) 122 0 
Andri (2018) 96 0 
Chen (2019) 95 0 
Garg (2019) 93 0 
Nútez-Marcos (2017) 91 0 
Zheng (2019) 87 0 


(Continued) 
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Table 1.5 Continued. 


Document Citations Links 
Du (2018) 86 0 
Zhao (2018) 82 0 
Masood (2018) 79 0 
Mahmud (2018) 76 0 
Bianchi (2019) 75 0 
Muhammad (2019) 75 0 
Conti (2017) 70 0 


1.6.5.1 Co-occurrence for all keywords 

When analyzing co-occurrences, many keywords are considered. To be con- 
sidered, the keywords must have at least five occurrences. Only 687 of the 
9624 keywords tested positive. The word Internet of Things appears 1071 
times with a total link strength of 11,141. Figure 1.7 depicts a network 
analysis of co-occurrences for all terms. The following are the parameter 
settings: The kind of research is co-occurrence; the method of counting is full 
counting; and all keywords are the focus of the study. The top 10 keywords 
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Figure 1.7 Network analysis of co-occurrences for all keywords. 
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Table 1.6 Top 10 keywords of blockchain in healthcare research on the basis of the 
frequency 


Keyword Occurrences | Total link strength 
Internet of Things 1071 11,141 
Convolutional neural networks 796 8088 
Deep learning 626 6781 
Convolutional neural network 553 5812 
Convolution 541 5997 
Neural networks 376 4049 
Internet of Things (IoT) 336 3765 
Deep neural networks 309 3494 
Learning systems 239 2895 


of IoT-enabled CNN research on the basis of the frequency are shown in 
Table 1.6. 

The density visualization map based on occurrences-weights, as shown in 
Figure 1.8, reveals the fascinating study themes. 

The size of the circle represents the weight of the keyword link in the 
overlay visualization, and the gradient color from blue to yellow shows the 
average citation score of a term. 
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Figure 1.8 Density visualization map for all keywords. 
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Figure 1.9 Network analysis of co-occurrences for author keywords. 


1.6.5.2 Co-occurrence for author keywords 
Considering the minimum occurrences of a particular keyword 5 as a thresh- 
old, out of 3273 keywords, 130 met the threshold. Network analysis of 
co-occurrences in terms of author keywords is shown in Figure 1.9. 

The top 10 author keywords of IoT-enabled CNN research on the basis of 
the frequency are shown in Table 1.7. 

The density visualization map depending on author keyword occurrences- 
weights revealed the hot study subjects, as illustrated in Figure 1.10. 


1.6.5.3 Co-occurrence for index keywords 
Considering the minimum occurrences of a particular keyword 5 as a 
threshold, out of 7367 keywords, 595 met the criteria. Network analysis of 
co-occurrences for index keywords is shown in Figure 1.11. 

The top 10 index keywords of IoT-enabled CNN research on the basis of 
the occurrences are shown in Table 1.8. 

As shown in Figure 1.12, the density visualization map based on index 
keyword occurrences-weights demonstrates the hot research topics. 
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Table 1.7 Top 10 author keywords of loT-enabled CNN research on the basis of the 
frequency 


Keyword Occurrences | Total link strength 
Deep learning 385 752 
Convolutional neural network 265 444 
Internet Of Things 224 493 
IoT 123 255 
Convolutional neural networks 121 219 
CNN 112 216 
Machine learning 95 218 
Convolutional neural network (CNN) 85 136 
Internet Of Things (IoT) 85 168 
Edge computing 46 120 


A , VOSviewer 


Figure 1.10 Density visualization map for author keywords. 


1.7 Limitations and Future Work 


e Computational Complexity and Resource Availability: Without a doubt, 
the benefits of deep neural networks (DNNs) and IoT complement each 
other. However, many scientific researchers are attempting to answer 
the difficulty of properly combining DNNs with IoT. Cloud computing 
provides intrinsic benefits such as virtualization, large-scale integration, 
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Figure 1.11 Network analysis of co-occurrences for index keywords. 


Table 1.8 Top 10 index keywords of loT-enabled CNN research on the basis of the frequency 


Keyword Occurrences | Total link strength 
Internet of Things 1013 9431 
Convolutional neural networks 729 6383 
Convolution 538 5331 
Deep learning 530 5359 
Convolutional neural network 377 3858 
Neural networks 355 3552 
Deep neural networks 305 3095 
Internet of Things (IoT) 293 2942 
Learning systems 238 2524 
Classification (of information) 144 1521 


high dependability, high scalability, and cheap cost. Because of the huge 
DNN models and computational complexity, computing the inference 
results on devices with minimal resources is problematic. The obvious 
answer is to install the DNN model on a server in a cloud data center. 
However, with so many computing jobs being done in the cloud, the data 
that must be transported is huge in size and quantity, putting significant 
strain on the network bandwidth and computational power of the cloud 
computing infrastructure. 


22 Convolutional Neural Networks in Internet of Things: A Bibliometric Study 


Figure 1.12 Density visualization map for index keywords. 


e Network Overhead: Furthermore, practically, all IoT applications 
demand ultra-low power consumption, limited storage space use, and 
real-time data processing, particularly those that are sensitive to delay 
or are highly interactive. The large volume of data transfer has sig- 
nificantly increased the strain on the backbone network, resulting in 
massive expansion and maintenance expenditures for service providers. 
Traditional approaches for DNN computations rely on massive clouds 
to meet the exorbitant resource needs of DNNs. However, utilizing this 
strategy may result in significant delays and lost energy. 

e Data Privacy: In practical use, IoT applications have critical real-time 
needs as well as user data privacy concerns. Deploying DNNs at edge 
computing nodes rather than faraway cloud servers is more efficient and 
safer. 

e Sensor Deployment: Deep learning methods’ performance is dependent 
on data sources. Even if the model’s architecture is correctly constructed, 
the deep model is ineligible to play a role if there is insufficient clean 
data. As a result, determining how to install data gathering gadgets is 
a significant research subject. The number of sensors employed and 
how well they are dispersed has an impact on the reliability of the 
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information collected. The information included in the data is truly the 
key to resolving issues. For the whole work-flow of an IoT program, a 
data collection module must be built. For example, Li et al. (2016) pur- 
posefully constructed a photo collecting module DeepCham to increase 
the model’s identification accuracy. In fact, it incorporated the concept 
of crowdsourcing within the data collecting module. In the creation of 
practical DL-based IoT applications, a cost-effective, dependable, and 
credible paradigm for the collection of data is needed. 

Performance Degradation: Training a deep network necessitates time- 
consuming procedures. As we all know, the depths of a deep learning 
network affect its ability to extract crucial characteristics. The gradient 
vanishment problem, on the other hand, develops when models go 
deeper, degrading performance. To that purpose, Hinton and Salakhut- 
dinov (2006) suggest stacking RBMs as a method for pre-training 
models. Furthermore, replacing the sigmoid function with the ReLU 
function assists to alleviate the gradient vanishment problem. Another 
big issue that may arise when learning deep models is generalizing. 
The major answer is to add new data or to lower the number of 
variables. To decrease the number of parameters, one effective way 
is to use convolutional kernels, and applying the dropout (Srivastava 
et al., 2014) is another option. Furthermore, a big breakthrough in 
CNN has been realized in recent years, and the number of levels in 
CNN algorithms has grown from 5 to over 200. Methods covered in 
these standard CNN (such as developing smaller convolutional kernels 
or batch normalization) could be useful when using the deep learning 
approach methods to find challenges in the wireless communication 
field. 

Resource-Constrained Embedded System: Deep learning is a powerful 
approach for analyzing vast volumes of data, but it requires a lot of 
equipment. It is still challenging to develop a detailed model of a 
resource-constrained embedded system. Till date, two types of research 
have been performed in an effort to solve the challenge. Considering 
end devices (such as a smart phone) to be data collectors. All data is 
sent to competent servers for analysis. Nevertheless, during this method, 
data exposure, network breakdown, and other difficulties may occur. 
Another alternative is to reduce complexity of the network while losing 
some efficiency so that some training sessions may be performed on end 
devices. 
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1.8 Conclusion 


Convolutional neural networks (CNNs) have shown to be ground breaking in 
computer vision applications, regularly outperforming standard models and 
even humans in image identification challenges. CNNs are widely evaluated 
on computer vision applications, despite their far-reaching impacts. When 
implementing a CNN model in a resource-constrained context, like IoT 
devices or cell phones, striking a compromise among predictive performance 
and processing cost is critical to ensuring the model functions appropriately. 
The Internet of Things (IoT) has a huge influence on society and the economy 
as a revolutionary network system combined with computer, control, and 
communication technology. loT and CNN integration has been widely used 
in medical, intelligent transportation, smart homes, and other industries. 
CNN would likely play a role in drawing important conclusions from this 
vast volume of data, hence assisting in the development of smarter IoT. 
Convolutional neural networks (CNNs) excel in a wide range of machine 
learning and deep learning applications. This chapter presented a visual and 
bibliometric analysis of current loT-enabled CNN concepts. This bibliomet- 
ric study looked at 1286 legitimate IoT CNN documents from the Scopus 
database from 2015 to October 2021. Since 2016, the number of publications 
on blockchain technology has exploded. The VOSviewer-identified research 
hotspots have supplied thorough information on the literature. Theoretical 
knowledge and hot research subjects on blockchain in healthcare are largely 
dispersed across: (1) Internet of Things; (2) convolutional neural networks; 
and (3) deep learning based on the high-frequency keywords apart from the 
blockchain and healthcare. 


References 


Chen, H., & Deng, Z. (2020). Bibliometric analysis of the application 
of convolutional neural network in computer vision. [EEE Access, 8, 
155417-155428. https://doi.org/10.1109/ACCESS.2020.3019336 

Donthu, N., Kumar, S., Mukherjee, D., Pandey, N., & Lim, W. M. (2021). 
How to conduct a bibliometric analysis: An overview and guidelines. 
Journal of Business Research, 133, 285-296. https://doi.org/10.1016/J. 
JBUSRES.2021.04.070 

Ellegaard, O., & Wallin, J. A. (2015). The bibliometric analysis of scholarly 
production: How great is the impact? Scientometrics, 105(3), 1809-1831. 
https://doi.org/10.1007/S11192-015- 1645-Z 


References 25 


Gamboa-Rosales, N. K., Castorena-Robles, A., Casas- Valadez, M. A., Cobo, 
M. J., Castaneda-Miranda, R., & Lopez-Robles, J. R. (2020). Decision 
making using Internet of Things and machine learning: A bibliometric 
approach to tracking main research themes. 2020 International Conference 
on Data Analytics for Business and Industry: Way Towards a Sustainable 
Economy, ICDABI 2020. https://doi.org/10.1109/ICDABI51230.2020.932 
5656 

Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality 
of data with neural networks. Science, 313(5786), 504-507. https://doi.or 
g/10.1126/SCIENCE. 1127647 

Li, D., Salonidis, T., Desai, N. v., & Chuah, M. C. (2016). DeepCham: 
Collaborative edge-mediated adaptive deep learning for mobile object 
recognition. Proceedings - 1st IEEE/ACM Symposium on Edge Computing, 
SEC 2016, 64—76. https://doi.org/10.1109/SEC.2016.38 

Li, J., Li, X., & Ning, Y. (n.d.). Deep Learning Based Image Recognition for 
5G Smart IoT Applications. 

de Moya-Anegon, F., Chinchilla-Rodriguez, Z., Vargas-Quesada, B., Corera- 
Alvarez, E., Muñoz-Fernández, F., González-Molina, A., & Herrero- 
Solana, V. (2007). Coverage analysis of Scopus: A journal metric approach. 
Scientometrics, 73(1), 53-78. https://doi.org/10.1007/511192-007-1681-4 

Nakhodchi, S., & Dehghantanha, A. (2020). A bibliometric analysis on the 
application of deep learning in cybersecurity. Security of Cyber-Physical 
Systems, 203-221. https://doi.org/10.1007/978-3-030-45541-5_11 

Niu, S., Liu, M., Liu, Y., Wang, J., & Song, H. (2021). Distant domain transfer 
learning for medical imaging. IEEE Journal of Biomedical and Health 
Informatics, 25(10), 3784-3793. https://doi.org/10.1109/JBHI.2021.30 
51470 

Njima, W., Njima, W., Zayani, R., Terre, M., & Bouallegue, R. (2019). Deep 
CNN for indoor localization in loT-sensor systems. Sensors, 19(14), 3127. 
https://doi.org/10.3390/S 19143127 

Sakhnini, J., Karimipour, H., Dehghantanha, A., Parizi, R. M., & Srivastava, 
G. (2021). Security aspects of Internet of Things aided smart grids: A 
bibliometric survey. Internet of Things, 14, 100111. https://doi.org/10.1 
016/J. IOT.2019.100111 

Shin, P. W., Kim, B., & Hwang, S. (2019). An IoT platform with monitoring 
robot applying CNN-based context-aware learning. Sensors, 19(11), 2525. 
https://doi.org/10.3390/S 19112525 

Shylaja, S. L., Fairooz, S., Venkatesh, J., Sunitha, D., Rao, R. P., & Prabhu, 
M. R. (2021). IoT based crop monitoring scheme using smart device with 


26 Convolutional Neural Networks in Internet of Things: A Bibliometric Study 


machine learning methodology. Journal of Physics: Conference Series, 
2027(1). https://doi.org/10.1088/1742-6596/2027/1/012019 

Song, Y., Liu, T., Wei, T., Wang, X., Tao, Z., & Chen, M. (2021). FDA3: 
Federated defense against adversarial attacks for cloud-based IloT appli- 
cations. IEEE Transactions on Industrial Informatics, 17(11), 7830-7838. 
https://doi.org/10.1109/T11.2020.3005969 

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, 
R. (2014). Dropout: A simple way to prevent neural networks from 
overfitting. Journal of Machine Learning Research, 15(56), 1929-1958. 
http://jmlr.org/papers/v 15/srivastaval4a.html 

Van Eck, N. J., & Waltman, L. (2009). Software survey: VOSviewer, a com- 
puter program for bibliometric mapping. Scientometrics, 84(2), 523-538. 
https://doi.org/10.1007/S11192-009-0146-3 

Yao, S., Hu, S., Zhao, Y., Zhang, A., & Abdelzaher, T. (2017). DeepSense: 
A unified deep learning framework for time-series mobile sensing data 
processing. 26th International World Wide Web Conference, WWW 2017, 
351-360. https://doi.org/10.1145/3038912.3052577 


2 


Internet of Things Enabled Convolutional 
Neural Networks: Applications, Techniques, 
Challenges, and Prospects 


Sunday Adeola Ajagbe!”, Matthew O. Adigun?, 
Joseph B. Awotunde”, John B. Oladosu‘, 
and Yetunde J. Oguns> 


1 4Department of Computer Engineering, Ladoke Akintola University 

of Technology LAUTECH, Nigeria 

12Department of Computer Science, University of Zululand, South Africa 
3Department of Computer Science, University of Ilorin, Nigeria 
Department of Computer Studies, The Polytechnic Ibadan, Nigeria 
E-mail: saajagbe E pgschool.lautech.edu.ng; adigunm @unizulu.ac.za; 
awotunde.jb @unilorin.edu.ng; jboladosu@lautech.edu.ng; 

oguns. yetunde @ polyibadan.edu.ng 

*Corresponding Author 


Abstract 


The Internet of Things (IoT) has been proven useful for the interconnection of 
computing devices embedded in objects to enable objects to send and receive 
data through the Internet for day-to-day activities. It connects computing 
devices with mechanical, electronics, and other objects such as animals or a 
human with the help of unique identifiers that have the ability for data trans- 
mission through a network without the intervention of a computer or human 
being. It helps to work smartly and gain control over activities, it is popular 
in homes automation, and it is essential to business as it provides real-time 
monitoring into how businesses work. The IoT setup also provides supply 
chain and logistics operations with information on machine performance 
and ensure fast delivery of information. The use of artificial intelligence 
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(AD, deep learning, and machine learning for IoT technology cannot be 
undermined especially to ensure that easier and more dynamic data collection 
is possible. However, there is no categorical appraisal of IoT research that 
used convolutional neural network (CNN) techniques for its implementation, 
to identify the prominent field where IoT and CNN are highly embraced in 
technological advancement and where IoT-enabled CNN technologies are 
facing challenges. Therefore, this chapter reviews the applicability of IoT- 
enabled CNN applications and techniques in various fields. The challenges 
and prospects of an IoT-enabled CNN system, the prominent field of research 
that embraces IoT-enabled CNN, were discussed. It summarizes the research 
according to the field of application and identified areas of limitations. This 
chapter recommends more collaborations between researchers and the users 
and experts in the areas where such IoT applications are used to overcome 
limitations, enhance IoT technologies and provide an insight into the security 
of future work in cyberspace; these include improvement in the IoT-based 
mobility and scalability, management of quality of services (QoS), among 
others. 


Keywords: Artificial intelligence, convolutional neural networks, deep learn- 
ing (DL), IoT-applications, loT-cybersecurity, machine learning (ML), indus- 
trial loT. 


2.1 Introduction 


The Internet of Things (IoT) is a network of networked computing devices, 
mechanical and digital technologies, goods, animals, and people that use 
unique identifiers (UIDs) as a tool to enable data transfer without human or 
computer contact. It has a number of benefits for industries and professionals. 
It uses a huge number of sensors to generate massive amounts of data that 
may be used for a variety of applications and purposes. It can also be used 
in a variety of smart devices that are connected to the Internet, such as 
household appliances, sensors, phones, automobiles, and computer systems 
(since they are Internet-enabled). It has connectivity, interaction, and data 
interchange with a wide range of applications, from home automation and 
wearable technology to large-scale infrastructure development [1]. The IoT 
enables data integration and effective interchange between the individual in 
need and the service provider [2, 3]. It entails quality potentials to process 
data in the cloud environment to ensure the user’s advantage. 
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IoT has been proven useful for the interconnection of computing devices 
embedded in objects to enable objects to send and receive data through 
the Internet for day-to-day activities. It connects computing devices with 
mechanical, electronics, and other objects such as animals or a human with 
the help of unique identifiers that have the ability for data transmission 
through a network without the intervention of a computer or human being. 
The application of CNN has been making the use of loT more effective in the 
cloud computing field, as it ensures the data for loT transmission error-free 
and provides ease of use of dataset. A person who has had a cardiac monitor 
implanted and a farm animal that has had a biochip transponder implanted, 
an automobile with in-built sensors to inform the driver on various issues, 
or any device that can be allocated an IP address and can send data over a 
network are all examples of things in the IoT. The IoT device connects to 
an loT gateway or another cutting-edge device to share sensor data, which 
is either forwarded to the cloud or analyzed locally. Since useful decision- 
making relies on data, IoT may use artificial intelligence (AI) to make data 
collection processes much easier and dynamic. 

In Figure 2.1, a typical IoT communication setup is depicted, with an 
application domain (interacting with services, which, in turn, interact with 
the data and the operator). The active satellite interacts with the data and 
network domain, while the network domain will, in turn, communicate with 
the dish and IoT device. The IoT device that makes use of the bridge through 
the gateway also communicates with the user of the system. The interaction 
and communication in this setup are enabled throughout the Internet and it 
is not possible without an Internet facility. The cloud environment provides 
infrastructures to support all these devices and ensures data transfer and 
other related interactions. Researchers are improving on these innovations, 
ensuring ease of use and improving the efficiency of the framework. 

The advent of COVID-19 pandemic brings technical solutions where vital 
in keeping many cities operational, and the long-term effects of engaging 
technologies on metropolitan environments may remain even after COVID- 
19 has passed. Online food and pharmacy shopping is made easier with the 
help of robots, drones, and contactless payments, particularly for the old 
and weak [4]. Manufacturers and enterprises can use IoT-enabled devices to 
control and automate their services so that it helps remote workers in services 
delivery [5, 6]. Due to the high-speed, low-cost Internet that 5G cells provide 
for urban areas, online health consultations, online learning, and cultural 
services are now possible. The epidemic has spread to about 216 countries 


30 Internet of Things Enabled Convolutional Neural Networks: Applications 


Application IOT 
Domain Network Device 
Satellite Doman 
MO y => 
QQ == 3 ty NS 
N 
Dish Gateway 
Services N 
| e JN a j k E } \ 
/ HA 
w | 
' User 
Data Bridge 


Figure 2.1 IoT communication setup. 


throughout the globe, making it the most deadly disease of the twenty-first 
century. Despite the availability of contemporary, advanced medical therapy, 
the disease is spreading through outbreaks. In the healthcare profession, both 
ML and DL research works have significantly made progress in a variety 
of areas (especially in data and image analysis), including giving help for 
eventual medical diagnosis [7]. 

Several solutions have been provided by loT applications, and some 
services are delivered remotely with the introductions of IoT. Dian et al. [8] 
conduct a thorough review of the most recent and significant studies in the 
field of wearable IoT and cloud computing for data analysis and decision 
making. The study categorized the wearables into four primary clusters: (1) 
medical, (2) sports, and daily activities, (3) tracking and location, and (4) 
safety. The essential distinctions between the algorithms in each cluster are 
classified and studied, as well as the research difficulties and open concerns 
in each cluster for further investigations. However, the tools that enable IoT to 
analyze and transfer the data and make a decision were not discussed. Also, 
the application of loT goes beyond the four classes that the study identified 
in the study [9]. The existing cloud computing approach is inefficient for 
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analyzing massive amounts of data for AI or CNN applications in a short 
amount of time and meeting the needs of users. Therefore, there is a need 
to further break the frontier of research in the areas of IoT-enabled CNN, to 
study and analyze the application, techniques, challenges, and prospects of 
the duo. 


2.1.1 Contribution of the Chapter 


In the chapter, we expose the novelties of IoT-enabled based on CNN 
architecture, viz-a-viz the applications, techniques, challenges, and prospects 
(ATCP) were discussed. This chapter also discusses the prominent field where 
AI/CNN and IoT are embraced and where they face challenges and present 
the applicability of CNN-enabled IoT techniques in various fields intending 
to explore the prospect of the research area and recommend the area of 
improvement. 


2.1.2 Chapter Organization 


The organization of this chapter is as follows. The introduction of IoT- 
and CNN-enabled devices is entailed in Section 2.1. Section 2.2 describes 
the overview of artificial intelligence (AI) applications and IoT in various 
fields, Section 2.3 describes the CNN architecture, with seven distinct CNN 
explorations, and Section 2.4 is the applicability of CNN techniques in the 
IoT environment and reveals the challenges of applicability of CNN-enabled 
IoT in various fields. Finally, Section 2.5 concludes the chapter and highlights 
the directions of further studies. 


2.2 Application of Artificial Intelligence in loT 


The use of CNN applications as artificial intelligence (Al), deep learning, 
or machine learning for loT technology cannot be undermined especially 
to ensure easy implementation of data-driven technologies and make sure 
that data collection or acquisition is possible. In the comprehensive survey 
conducted by [10], the study was aimed at identifying the applications of 
AI in combating the challenges posed by the COVID-19 outbreak. As a 
result, it encompasses all AI methodologies used in this field to ensure 
holistic survey. The goal of this work is to create an intelligent computing 
algorithm for forecasting COVID-19 outbreaks. The Facebook prophet algo- 
rithm forecasts 90-day future values, including the peak date of confirmed 
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COVID-19 cases, for six of the world’s worst-affected countries, and six of 
India’s high-incidence states. The model also reveals five important change 
points in the growth curve of verified Indian cases, indicating the impact of 
the government’s initiatives on the infection’s rate of spread. The model’s 
goodness-of-fit is 85% for six countries and six Indian states [11]. 

Although DL and ML are branches of AI, there are networks under 
DL and ML for successful implementations of projects in each case. The 
CNN has a recurrent neural network, generative adversarial network, auto- 
encoder, deep belief network, restricted Boltzmann machine, and human 
activity recognition. ANNs are examples of supervised and unsupervised 
networks in DL [12]. Each DL model has its merits and demerits, and these 
are evident in the performance evaluation of the proposed model. In the same 
vain, the conventional ML algorithms delineate the superiority of the DL 
model over other models. This section serves as an impetus for advanced 
research in the field of AI-IoT applications for better performance, in addition 
to revealing the potential of AI in IoT applications. Table 2.1 is a review 
of Al-enabled IoT studies in different fields of applications; it reveals and 
describes AI and IoT studies in different fields of applications as well as their 
references. Many novel studies were identified in the areas of health, natural 
language processing, technology, data mining, and security, among others. 


2.3 Convolutional Neural Networks and its Architecture 


A CNN is a type of neural network that has excelled in a number of compe- 
titions involving image processing and computer vision. Image classification 
and segmentation, natural language processing (NLP), processing of video, 
object detection, and speech recognition are just a few of CNN’s fascinating 
application areas. The high learning tendencies of CNN is traced to the 
use of numerous feature extraction stages which can automatically learn the 
representations from data, and the transmission of these can be easier when 
the architecture interface with the receiving end with an Internet-enabled 
device (IoT). The availability of a large amount of dataset as well as hardware 
developments have spurred CNN research, and fascinating CNN architectures 
have lately been revealed [25, 26]. The activation and loss functions of var- 
ious types, parameter optimization, regularization, and architectural design 
improvements are all among many things that can be done like a few of 
the unique ways to improve CNNs that have been studied. On the other 
hand, architectural advancements have resulted in a massive rise in CNN’s 
representational capacity. The spatial explorations and channel information, 
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Table 2.1 Review of Al-enabled IoT studies in different fields of applications 


AI-IoT studies and their descriptions. 


Field of 
application 


Reference 


AI-IoT survey study. 


Health 


[13] 


Deployment of AI-IoT into the chatbot agents works 
and ensures natural language processing. 


Natural 
language 
processing 


[14] 


AI-IoT was employed to ensure data mining and 
management to control congestion in the network. 


Data mining 


[15] 


4IR, or “Industry 4.0,” is a concept in which industrial 
gadgets and machines are connected to the Internet 
and interact to make choices using AI (M2M 
communication). 


Artificial 
intelligence 


[16] 


The frameworks for centralized and distributed 
loT-enabled AI technology were deployed. For 
various network designs, key technological challenges 
such as random access and spectrum sharing are 
examined. 


Technology 


[17] 


For successful data management, a high-level scheme 
architecture was created and applied at the edge-level 
micro services, together with an AI technique. 


Technology 


[18] 


The research proposes a hybrid AI/ML detection 
model as a solution for combating and mitigating IoT 
cyber risks in cloud computing settings, both at the 
host and network levels. 


Technology 


[19] 


The work reviewed novel technologies applications 
on AI-IoT and sum it up to AI-IoT. 


Technology 


[20] 


The study created a CNN model to enable the IoT 
platform’s contextually aware services, as well as 
tests to test the CNN model’s accuracy using a 
collection of photos acquired by the robot. The 
study’s experimental results showed that the learning 
accuracy was over 98%, indicating that the study 
improved image context recognition learning. The 
paper’s contribution was the creation of 
image-and-context-aware learning and intelligence 
using a CNN model enhancement of the proposed IoT 
platform, as well as the realization of an IoT platform 
with an active CCTV robot. 


Technology 


[21] 


The techniques and concepts of IoT and AI were 
explored, and the possibility of applying blockchain 
for providing security was also discussed. The study’s 
main focus was on the use of integrated technologies 
to improve data models, gain better 


Data modeling 
and intelligence 
prediction 


[22] 


(Continued) 
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Table 2.1 Continued. 


AI-IoT studies and their descriptions. 


Field of 
application 


Reference 


insights, and discover new things, global verification, 
and innovative audit systems. Academics and industry 
professionals can exchange their thoughts and new 
research in the convergence of these technologies in this 
book, which is meant for both practitioners and scholars. 
Contributors provide their technical assessment and 
compare it to current technologies. There are also 
theoretical explanations and experimental case studies 
relating to real-time scenarios. IT workers, researchers, 
and academics working on fourth-generation technology 
will benefit from this study. 


To combat the COVID-19 pandemic, three research 
directions were proposed using Al-based methods: To 
improve diagnosis, deep convolutional neural networks 
(DCNN) with transfer learning were used to classify 
chest X-ray images. Patient pandemic risk prediction 
based on patient features, comorbidities, first symptoms, 
and vital signs for disease prognosis; and deep neural 
networks are being used to forecast disease transmission 
and case fatality rates. Additionally, some of the issues of 
open datasets and research opportunities were discussed. 


Health 


[7] 


The number of AI/ML algorithms at the edge is growing 
in tandem with the number of low-cost sensors allowing 
IoT remote sensing using AI/ML algorithms at the edge. 
For establishing a distributed remote sensing network 
and interfacing with the cloud, raspberry PI, and other 
boards with appropriate sensor devices are readily 
available. The paper describes the components and 
preliminary findings of establishing a small distributed 
remote sensing network for recognizing and identifying a 
range of target kinds using NVIDIA Jetson Nano edge 
devices with low-cost acoustic and image sensors, as 
well as IoT Greengrass. Additionally, cloud services 
were used to enable auditing and monitoring, resulting in 
a secure and dependable operational service 
environment. 


Security 


[23] 


Comprehensive analysis of loT security requirements 
and difficulties were presented, as well as a discussion of 
the unique role of DL and a review of state-of-the-art 
research work in IoT contexts employing DL 
methodologies. A comparative examination of DL 
algorithms such as RNN, LSTM, DBN, and AE was also 
performed. 


Technology 


[24] 
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Table 2.1 Continued. 


AI-IoT studies and their descriptions. Field of Reference 
application 
Based on the notions of applications of CNN, the | Security [25] 


TCNN, an intelligent model for loT-IDS that combines 
CNN with general convolution, was proposed. The 
TCNN is built using synthetic minority oversampling 
methods with nominal continuity to accommodate 
unbalanced datasets in the model. It was then used 
with useful feature engineering approaches such as 
attribute transformation and reduction. Using the Bot- 
IoT data repository, the provided model is compared 
to two classical ML methods, random forest, logistic 
regression, and LSTM on the CNN architecture. 


architecture depth and width, and multi-path information processing, in par- 
ticular, have received a lot of attention in its networks taxonomy. The use of 
a layer block as a structural unit is also becoming more widespread [27, 28]. 

The CNNs are biologically inspired structures that are at the heart of 
computer vision’s deep learning algorithms [29]. It is a feed-forward neutral 
network that uses many convolutional layers to extract features and fully 
connected layers with softmax to make inference [30]. In at least one of its 
layers, the system uses a mathematical linear operation called convolution 
instead of ordinary matrix multiplication, as indicated by the phrase “con- 
volutional neural networks.” As in a normal multi-layer neural network, at 
least one convolutional layer and at least one fully linked layer are present 
in a CNN. They are a common strategy in image recognition and computer 
vision frameworks, and they are verifiable and significant in the field of data 
science. Because the number of nodes in the input layer of CNN is dictated 
by inputting required data size, such as the resolution of an image, nodes 
number in the input layer is determined by inputting the required data size. 
The convolution computation names the convolutional layers. They are also 
thought to be the key to extracting the most distinguishing traits from the 
original dataset. Convolutional kernels are used to build feature maps, which 
are then activated by nonlinear activation functions including the sigmoid 
function, tan function, and rectified linear units function [31]. 

The CNNs exhibit remarkable performance among DL networks espe- 
cially with huge data, meaning that they have more layers that are deeper 
and more interconnected. The CNN receives a two-dimensional picture or 
voice as the input signal. With the help of a chain of hidden layers, which 
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includes convolutional layers and pooling layers, the CNN extracts hier- 
archical properties from the input data. The CNN will surely aid in the 
creation of smarter IoT-based solutions by deriving valuable insights from 
this vast volume of data. Exploring the possibilities of CNN for IoT data 
analytics becomes extremely important in this regard. AI and its various 
architectures spawned the CNN approach [12]. Linear regression, a machine 
learning application, is a concept for estimating the number of active cases, 
deaths, and recoveries. On the basis of the expected number of active cases, 
deaths, and recovery across India, the lockdown would be extended. A date- 
wise analysis of current data was performed to predict the number of active 
cases, recoveries, and death, and necessary parameters such as daily recovery, 
daily deaths, and the increasing rate of COVID-19 cases were included. To 
make expected findings more understandable, each analysis and forecast was 
graphically represented [32]. 

The most prominent DL network model is CNN. There are three layers 
in CNN: (1) convolutional, (2) pooling, and (3) fully connected layer. The 
convolutional layer is made up of a finite number of filters that combine 
with the input data to identify a large number of important features in the 
input image. The pooling layer uses a down-sampling approach to reduce the 
number of the resultant features, reducing the overall computational efforts. 
The system goes deeper by repeating the convolution-pooling sequences 
numerous times, depending on the data and the required accuracy. The 
method collects more high-dimensional features from the input dataset, fol- 
lowed by one or more fully connected layers for classification [33]. There 
are many C++, MATLAB, and Python-based frameworks to execute various 
tasks on the CNN architecture [34]. CNNs are also one of the most often used 
strategies in image recognition, object identification, and computer vision 
frameworks, and they are undeniably significant in data science. It employs 
deep learning algorithms, which take video/image input, give weights and 
bias to various parts of the image, separate them from one another, and 
transform the data into a numerical dataset. They try to make use of the 
spatial information contained within an image’s pixels. As a result, they rely 
on discrete convolution [35]. 

Although the introduction of new generations of networks has posed 
substantial issues in terms of meeting the needs of many new applications, the 
majority of which necessitate advanced infrastructure to offer the necessary 
resources to assure high quality of service. Among biologically inspired 
artificial intelligence (AI) techniques, CNNs are the most extensively used 
algorithms. CNN’s beginnings can be traced back to [25, 36] neurological 
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CNN Architectural 
Innovation 


CNN Taxoxomy 


Figure 2.2 CNN architectural innovation. 


experiments. As a result of a survey conducted in [25], which focuses on 
the intrinsic taxonomy found in recently reported CNN architectures and 
divides current CNN architecture improvements into seven groups. The spa- 
tial exploitation, depth, multi-path, width, feature map, channel exploitation, 
as well as attention are the seven categories of the CNN taxonomy. In 
addition, a basic grasp of CNN components, current problems, and CNN 
applications is offered, and the taxonomy of the survey is presented in 
Figure 2.2. In general, all these groups have influences on the architectural 
performance of the networks. 


2.3.1 CNN Based on Spatial Exploration 


The number of layers, the biases, the weights, and the processing units, stride, 
activation function, learning rate, and other parameters and hyper-parameters 
abound in CNNs for effective implementation [25]. The levels of differences 
in correlation can be examined by utilizing different filter sizes because 
convolutional operations analyze the neighborhood of input pixels. Filters 
of various sizes encompass various levels of granularity; typically, small size 
filters extract fine-grained information, while large size filters extract coarse- 
grained information. The spatial exploration for CNN brought about LeNet, 
VGG, GoogleNet, AlexNet, and ZfNet which were the early products of 
spatial exploration [37, 38]. This exploration does not consider IoT. 
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2.3.2 Depth of CNNs 


The depth of CNN architecture refers to the deep on the convolutional 
layers of the networks, they are based on the premise that as the network’s 
depth increases, and more nonlinear mappings, it helps in the improvement 
of goal function in the networks [39]. The success of supervised training 
has been attributed to the network depth. The deeper networks can express 
some classes of function more effectively than shallow designs, according 
to theoretical studies. This operates based on Highway Networks, ResNet, 
Inception-ResNet, and Inception-V3, V4 [40, 41]. 


2.3.3 Multi-Path of CNNs 


Deep network training is difficult to undertake, and it has been the focus of 
contemporary deep network research. In general, CNNs excel at complicated 
tasks. Meanwhile, they may experience performance degradation, gradient 
disappearing, or explosion issues, which are caused by an increase in the 
depth rather than overfitting [42, 43]. The disappearing gradient problem 
leads to a higher test error as well as a bigger training error [44]. The 
concept of cross-layer connectivity was developed for deep network training. 
Multiple pathways can be used to connect one layer to another while skipping 
some intermediary layers, allowing specialized information to travel across 
the layers. The Highway Network, ResNet, and DenseNet are the major 
techniques of multi-path [45-47]. 


2.3.4 Width for Multi-Connection CNNs 


The focus of [48] was primarily on leveraging the potential of depth, as well 
as the efficacy of multi-pass regulatory links in network regularization. But, 
according to [49], the network’s width is equally crucial. The multi-layer 
perceptron obtained the advantage of mapping complex functions over per- 
ceptron by using numerous processing units in parallel within a layer. It shows 
that, like depth, width is an important component in developing learning 
principles. The authors in [50] demonstrated that NNs with ReLU activation 
functions must be wide enough to maintain universal approximation while 
still increasing in depth. The pyramidal Net, wide ResNet, ResNet, Inception, 
and Xception are all family of the exploration of CNN architecture [51]. 


2.3.5 Feature-Map Exploitation for CNN 


The hierarchical learning and automatic feature extraction capabilities make 
it possible for the CNN architecture to become useful for MV tasks. 
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The performance of modules for classification and segmentation are heavily 
influenced by feature selection. The weights related to a kernel are called 
a mask and are tuned in CNN to choose features dynamically. In addition, 
various steps of feature extraction are performed, allowing for the extraction 
of a wide range of features. Some of the feature maps, on the other hand, 
play a little function in object discrimination [52, 53]. Large feature sets may 
produce a noise effect, resulting in overfitting of the network and most of the 
time causes the squeeze and excitation of the networks. Conversely, a low 
feature set may result in the underfitting of the networks. 


2.3.6 The CNN Channels for Exploitation 


The performance of image processing techniques, both conventional and DL 
systems, is heavily influenced by a picture representation. A good image 
representation is one that can define the key elements of an image with a 
small amount of code. Some conventional filters are used in MV tasks to 
extract different degrees of information from the same image type. The model 
then uses these various representations as an input to improve performance 
[54]. CNN is now a compelling feature learner that can extract discriminating 
features automatically depending on the challenge [55]. Channel exploration 
for CNN produces channel exploited CNN using transfer learning (TL) and 
[56] develops a novel CNN design called channel boosted CNN in 2018, 
based on the idea of increasing the input channel numbers to improve the 
network’s representation capacity [56]. Channel boosting is accomplished by 
using auxiliary deep generative models to artificially create extra channels, 
which are subsequently exploited by deep discriminative models. 


2.3.7 Attention Exploration for CNN 


Different levels of abstraction play a key role in determining the neural 
networks (NNs) discrimination power. In addition to learning numerous 
hierarchies of abstractions, concentrating on context-relevant features is 
important for image identification and localization [25], [57]. The effect that 
makes the learning numerous hierarchies of abstractions, concentrating on 
context-relevant features is important for image identification and localization 
is also known as attention in the human visual system [58]. Because convolu- 
tional operations allow for weight sharing, distinct sets of features inside an 
image can be retrieved by sliding kernels with the same set of weights on the 
picture, making CNN parameters more efficient. The convolution operations 
can be classified based on the type and size of filters, the type of padding, and 
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the direction of convolution. This operation can be executed in three steps: 
data collection and feature generation, preprocessing and feature selection, 
and data partition, learning, and analysis/evaluation [59, 60]. 


Figure 2.3 shows CNN system operational layout steps that reveal the activi- 
ties in each step of the convolutional operation. It is divided into three distinct 
steps for use of the CNN architecture. They are as follows: 


(i) Data collection and feature generation: All AI experiments are about 
data; so the first thing is to acquire data, and all the acquired data has 
features. Therefore, this is the first step in CNN operation layout. 

(ii) Data preprocessing and feature selection: The operation to be carried out 
here include: missing data input, feature analysis, data normalization, 


Figure 2.3 Typical CNN system operational layout steps. 
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and the selection of the relevant features to the project or experiment at 
hand. 

(111) Data partitioning, learning, and analysis: In some studies, this may be 
broken into two, depending on the implementation and the data involved. 
Then, data partitioning and learning would be one and analysis or 
evaluation would be the second one. The data would be partitioned into 
training, validation, and testing dataset; for better CNN operation and to 
avoid overfitting and underfitting of the models, attention is paid to the 
dataset and model for the implementation. The analysis or evaluation is 
done based on the accuracy of the model(s), recall, and other metrics 
depending on the particular operation. The metric for data classification 
is different from data segmentation operation. 


2.4 The CNN Techniques in loT Environment 


The Internet of Things (oT) is a network of smart objects like sensors, home 
appliances, phones, automobiles, and computers that are connected via the 
Internet [61]. Although the IoT is not a new concept, it is currently a big 
issue around the world. IoT refers to the interconnection of electronic devices 
to allow data to be exchanged between them for specific domain applications. 
This internetworking concept in loT makes human life much easier than it 
was previously. A TCNN, which combines CNN with causal convolution, 
was devised and applied using the CNN technique in the construction of an 
effective and efficient DL-based intrusion detection system for the IoT [62]. 
CNNs are also one of the most often used strategies in image recognition, 
object identification, and computer vision frameworks, and they are unde- 
niably significant in data science. CNN approaches employ deep learning 
algorithms, which take video/image input, assign weights and bias to various 
parts of the image, and then distinguish them from one another [63]. They try 
to make use of the spatial information contained within an image’s pixels. As 
a result, they rely on discrete convolution [64]. 

“Can an existing optimized CNN model be used to automatically generate 
a competitive CNN for an IoT application whose objects of interest are a frac- 
tion of the categories that the original CNN was meant to classify, resulting 
in a proportionally scaled-down resource requirement?” was looked into. The 
notion and proposed method for the automatic synthesis of resource scalable 
CNNs from an existing optimized baseline CNN was referred to as resource 
scalability. The result shows that synthesized CNN has the learning power 
to handle the provided IoT application requirements while also providing 
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competitive accuracy. The suggested method is quick, and unlike current 
CNN design practice, it does not necessitate iterative rounds of training 
trial and error [65]. With no prior knowledge, the authors in [66] proposed 
a mobile-based methodology for counting people in high- and low-packed 
public locations in Saudi Arabia under a variety of scenario conditions. The 
deeper convolution neural network (DCNN) was introduced such that it is 
based on a pre-trained CNN called VGG-16 with some modifications to 
the last layer of the CNN to improve the training model’s efficiency [67]. 
The emphasis is on crowd counts as well as producing high-quality density 
maps. The proposed method improves efficiency while accepting photos of 
various sizes and scales as inputs. The model’s backbone is made up of pure 
convolutional layers, which allow for the utilization of images of various sizes 
and resolutions. The counting system mobile application attempts to shorten 
users’ wait times by displaying the crowd size of the place they are visiting. 

Kimbugwe et al. [68] gives an in-depth look at how DL algorithms have 
been used to improve QoS in the IoT. According to the study, QoS in IoT- 
based systems is disrupted when the systems’ security when IoT resources 
are not adequately managed. As a result, the goal of this study is to learn 
how DL has been used to improve QoS in IoT by preventive security and 
privacy breaches in IoT-based environments and assuring proper and effi- 
cient resource allocation and management. It selected the most commonly 
utilized DL algorithms for dealing with IoT QoS concerns and described 
the state-of-the-art of those techniques. For resource-constrained contexts, 
Lawrence and Zhang [69] developed IoTNet, a CNN-IoT enabled technique 
that achieves state-of-the-art performance within the domain of tiny efficient 
models. Instead of using depth-wise convolutions, loTNet sacrifices accuracy 
for computational expense in a different way than prior approaches by fac- 
toring standard 3 x 3 convolutions into pairs of 1 x 3 and 3 x 1 standard 
convolutions. The study compares IoTNet against state-of-the-art efficiency 
focused on the algorithms and scaled-down big frameworks on datasets 
that are most representative of the complexity of challenges encountered in 
resource-constrained contexts. 

The CNN techniques are doing pretty well in several areas, especially for 
data analysis and decision making. Many CNN techniques were applied to 
solve some problems in IoT environments, and the remarkable achievements 
of these are evident [70]. Contributions in the area of healthcare, agriculture, 
meteorology, intelligence learning, smart city, biometrics, eCommerce, and 
some of these novel works are discussed. Figure 2.4 is a typical application 
of CNN and IoT techniques setup in various fields. 
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Figure 2.4 Applications of CNN and IoT techniques in various fields. 


2.4.1 Intelligence Healthcare System 


The healthcare system has been made an intelligence system, through the evo- 
lution of IoT-based applications. The IoT-based technology helped to make 
diagnosis, management, test, data access, and contact tracing (especially 
during a pandemic) among other healthcare easier for both the healthcare per- 
sonnel and the patients [71]. IoT is converting traditional healthcare systems 
into more personalized ones, making it easier to diagnose, treat, and monitor 
patients. The current global pandemic caused by the novel highly contagious 
respiratory illness COVID-19 second variant is the most serious global public 
health disaster since the 1918 influenza pandemic [72]. Just like every other 
pandemic, since the outbreak of COVID-19, researchers have been working 
feverishly to leverage a wide range of technologies to battle the global threat, 
and IoT technology is one of the forerunners in this field. For the COVID- 
19 pandemic cases, IoT-enabled devices were connected and applications are 
used to reduce the risk of COVID-19 spreading to others by detecting the 
virus early, patients’ monitoring, and follow-up on the prescribed protocols 
once the patients have recovered [73]. The study reported a setback in the 
areas of mobility. 
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With a better knowledge of the COVID-19 pandemic natures, the authors 
in [74] proposed architecture that was used for the diagnosis of probable 
COVID-19 pandemic cases in real time, and, most importantly, the architec- 
ture could be used to monitor and forecast therapy for confirmed patients. The 
suggested system is divided into five layers: data collection, isolation, prepro- 
cessing of dataset, and layer analysis. Also, the health physician application 
layer and cloud infrastructure were used to connect the layers. The system 
uses smart technologies to collect biological data from patients and sends it to 
a cloud-IoT server for analysis and processing. As a result, any abnormalities 
in patient data will be notified to the patient’s physicians via the COVID-19 
monitoring device and alert platform. The results of the four algorithms used 
revealed that LGBM outperformed them all with a 97% accuracy, and this 
was followed by XGBoost that has 90% accuracy, and random forest that has 
76% accuracy. In the confusion matrix, the LGBM model had a recall rate of 
96%, indicating a good detection rate; the study is similar to [75]. Kumar et al. 
[22] proposed a 24-hour intelligence system that is non-invasive for the blood 
level glucose monitoring system. The system is meant for the management of 
invasive blood monitoring challenges. The system provides accurate reading 
and generates alerts using IoT so that undesirable effects caused by severe 
variations in blood glucose levels can be avoided. The system was able to 
make a decision with an ML model that has been trained with data. The 
system is implemented utilizing FPGA and achieves optimum efficiency and 
throughput while consuming little energy [76]. 


2.4.2 Intelligence Learning System 


The educational application should be more beneficial to art students and 
designers than to children. The understanding of object position and three- 
dimensional state is increasingly vital and crucial in painting and design. The 
primary and secondary relations of items can be found by separating several 
objects and finding the relation, especially in the field of picture segmentation. 
The location relationship of objects can be readily understood after removing 
the complex background. The majority of the assistance provided to children 
and students will be in the area of thinking construction. It can effectively 
assist students in forming a three-dimensional perspective in their thinking, 
such as in three-dimensional geometry questions in the college entrance exam 
[77, 78]. The knowledge-level assessment in e-learning systems using ML 
and user activity analysis was proposed. The project aimed at designing a 
futuristic intelligent and autonomous e-learning, where ML and user activity 
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analysis serve as an automatic knowledge level evaluator. To alter the content 
presentation and have a more realistic evaluation of e-learning, it is necessary 
to measure their knowledge level. Many classification methods are used 
to forecast the learners’ knowledge level, with the results being reported. 
In addition, the study provides a modern design for a dynamic learning 
environment that follows current e-learning trends. The experimental results 
show that when evaluating knowledge levels, an SVM model outperforms 
other algorithms with 98.6% of examples correctly classified and an MAR of 
0.0069 [79]. 


2.4.3 Smart City 


An intelligence-based city is referred to as a smart city; this involves but is 
not limited to a smart transportation system, intelligence vehicle, and traffic 
and highway management. Being an intelligence system, AI techniques play 
a pivotal role in the development of such systems and the interoperation 
strength lies in IoT since the connection is the cloud. The use of IoT also 
provides supply chain and logistics operations with information on machine 
performance. Payvar et al. [80] worked on vehicle classification; the study 
described an NN-based image classifier that has been trained to categorize 
vehicle photos into four categories. Binarization is used to optimize the 
neural network, and the resulting binarized network is placed on an IoT-class 
processor for execution. The binarization reduces CNN’s memory footprint 
by about 95% while increasing performance by more than 6%. In addition, the 
paper showed that by employing the processor’s proprietary instruction “pop- 
count,” the performance of the binarized vehicle classifier can be improved 
by more than two times, making the CNN-based image classifier appropriate 
for even the smallest embedded processors [80]. The work recommended 
improvement in the area of scalability. 

With the end objective of on-street vehicle identification, [73] proposes a 
pre-handled quicker regional convolution neural network (faster RCNN). The 
framework includes a faster RCNN preprocessing pipeline. The preprocess- 
ing technique is used to improve the faster RCNN’s preparation and detection 
time. To recognize paths, a preprocessing path identification pipeline based 
on Sobel edge administrator and Hough transform is used. After that, the 
gallery organizes a rectangle district, which is a less interesting location 
(ROD. When compared to faster RCNN without preprocessing, the proposed 
technique enhances the preparation speed of faster RCNN [73]. The system 
will be helpful in transportation and security, especially in urban centers. 


46 Internet of Things Enabled Convolutional Neural Networks: Applications 


In the case of a pandemic like COVID-19 in metropolises and cities, [81] 
proposes an end-to-end IoT infrastructure to promote social distancing. The 
design includes the most common IoT use cases in relation to COVID-19. 
Using the IoT architecture, the novelty of the work was the presentation of 
short- and long-term strategies for managing the social distancing method, 
although none of the CNN techniques was used. 


2.4.4 Agriculture 


Gikunda and Jouandeau [82] propose a classification taxonomy for CNN 
applications in agriculture. Finally, the report provided a complete assessment 
of research on the use of cutting-edge CNNs in agricultural production 
systems. The make of a two-fold contribution. For starters, the benchmarking 
findings can help end-users of agricultural DL applications choose the right 
architecture to utilize. Second, the in-depth research describes the state- 
of-the-art CNN complexity and brings out probable future paths to better 
optimize the operating performance for agricultural software developers of 
DL tools. The study concluded by identifying real-time image classification, 
interactive image detection, and classification, as the areas of improvements 
in agricultural domains, despite amazing progress in the use of cutting-edge 
CNN in agriculture in general. Gikunda and Jouandeau [82] investigated agri- 
cultural challenges that use the primary state-of-the-art CNN architectures 
that have achieved the maximum accuracy in a multi-class classification prob- 
lem in the ImageNet large-scale visual recognition challenge. Their analysis 
sheds light on the use of cutting-edge CNN models for smart farming and 
identifies computer-vision-related smart farming research and development 
issues. 

The article demonstrated the value of cutting-edge CNN in IoT-based 
smart farming. A survey was also conducted on the use of the identified CNNs 
in agricultural services. The results show that in all of the agricultural sector 
scenarios studied, state-of-the-art CNN provided greater precision. When 
compared to other image-processing algorithms, it achieves superior accuracy 
in the majority of the problems for agricultural service. Despite outstanding 
performance in using state-of-the-art CNN in agriculture in general, future 
researchers may explore gray areas in relation to smart farms, such as real- 
time image classification, interactive object identification, and classification. 
The mobility and scalability problems were reported in real-time applications 
of these technologies. 
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2.4.5 Meteorology 


With no prior knowledge, a mobile-based approach is suggested for counting 
people in low- and high-packed public locations in Saudi Arabia under 
diverse scene conditions. The suggested model is based on the VGG-16 
pre-trained CNN model, with some tweaks to the last layer of the CNN 
to improve the training model’s efficiency. The suggested method supports 
photos of arbitrary sizes/scales as inputs, in addition to improving efficiency. 
The applicability of the suggested method was assessed by implementing IoT 
architecture, which requires surveillance cameras to be connected to the Inter- 
net in order to gather live images of various public locations [83]. Although 
the prominent one in relation to loT is computational cost, because of the 
high computational cost of these topologies, big CNNs are often impractically 
sluggish, especially for embedded IoT devices. The early detection, isolation 
of the sick person, and identifying probable contacts are important when 
a pandemic first breaks out in cities. IoT protocols, particularly Bluetooth 
low energy, as well as RFID, NFC, and Wi-Fi, are gaining a lot of traction 
as answers to these problems [84]. Integration, automation, improved com- 
munication, and self-monitoring loT-based systems are increasing, and they 
help in the production of smart solutions (such as smart city, smart farming, 
technological process, meteorology, eHealth, and biometrics) that can be used 
in the analysis of the problem in different field and proffer solution. Figure 2.2 
depicts the applications of IoT devices. The study reported a setback in the 
areas of mobility. 


2.4.6 Biometrics Applications 


Rattani et al. [85] created a CNN architecture for combining biometric data 
from several sources. CNN-based multibiometric fusion has the advantages of 
(1) being able to execute early, and late fusion, and (2) the fusion architecture 
itself being learnable during network training. Experiments on large-scale 
VISOB data show that multibiometric CNNs outperform the traditional 
fusion method. Based on CNN techniques, [86] created a localization frame- 
work that moves the online prediction complexity to an offline preprocessing 
step (CNN). The indoor localization problem is described as 3D radio-image- 
based region recognition, inspired by the exceptional performance of such 
networks in the image classification field. Its goal is to precisely locate a 
sensor node by determining its position region. The received signal strength 
indicator fingerprints are used to create 3D radio pictures. The simulation 
findings validate the parameters, optimization strategies, and model designs 
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that were employed [87]. Based on a fingerprint, finger-vein, and facial 
recognition system with multiple biometrics, [88] suggested a hybrid sys- 
tem incorporating the effect of tree efficient models: CNN, random forest 
classifier, and softmax. In a traditional fingerprint system work, the segregate 
the foreground and background regions using K-means and DBSCAN algo- 
rithms, and image preprocessing was done, then the features are extracted 
using CNN and a dropout technique, and softmax is used as an identifier. 
The region of interest picture contrast enhancement utilizing the exposure 
fusion framework is fed into the CNN algorithms in a traditional finger 
vein system [89]. The results of these systems are combined to improve 
human identification. However, the interoperatabilities of the system and 
management of quality-of-service (QoS) were the issues observed. 


2.4.7 E-Commerce and E-Business 


In the context of business-to-consumer (B2C) in form of retails and e- 
commerce, Sohaib et al. [90] provide a proposed integrated framework of IoT 
and cloud computing for people with sensory, motor (restricted use of hands), 
and cognitive (learning and language disorders) impairments. It helps to work 
smartly and gain control over activities, it is popular in homes automation, 
and it is essential to business as it provides real-time monitoring into how 
businesses work. It also provides supply chain and logistics operations with 
information on machine performance. The work achieved a milestone by 
ensuring ease of use and improving security in an IoT environment. The 
framework uses state-of-the-art technology to ensure e-business activities and 
identified security challenges as an issue. 


2.5 Challenges of Applicability of loT-Enabled CNN in 
Various Fields 


Embedded systems are utilized in a variety of settings to achieve a wide range 
of applications, in the area of security, traffic control, e-health, smart home, 
and smart cities. Digital systems regulate physical things in various applica- 
tions, resulting in a constant interplay between the digital and physical worlds 
[91]. Meanwhile, the fundamental characteristics of these applications, par- 
ticularly when embedded systems are taken into account, provide difficulties. 
Also, CNN is similar to black boxes in that they are difficult to analyze and 
explain, confirming they can be difficult occasionally [92, 93]. Analyzing the 
massive amount of data in the cloud environment takes a long time, and this 
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is detrimental to the IoT applications and overall network performance of the 
system [94]. The authors in [95] carried out training on a CNN model with 
a noisy picture dataset, which may result in an increase in misclassification 
errors. The addition of a small quantity of random noise in the input image 
can fool the network, causing the original image to be distorted and slightly 
perturbed versions of the image to be classified differently [96]. 

Meanwhile, loT in conjunction and other emerging technologies such 
as 5G networks, AI, and cloud computing has the potential to revolutionize 
healthcare, commerce, agriculture, education, security, and other sectors; the 
high cost of mass-scale deployment has made it difficult for some e-health 
operators and governments to do so [97]. Typically, CNNs have many benefits 
that override its drawback in many operations, and that is why they are mostly 
embraced in data science projects. Meanwhile, challenges of techniques of 
CNNs with this digitalization include computational cost data rate, coverage, 
energy usage, and practicality which are all factors to be considered along 
with IoT [98]. CNN is applicable in the creation of an lo based DL-based 
intrusion detection system that is both effective and efficient. It adopts the 
necessary principles of a TCNN, that combines CNN with causal convolution 
was designed and implemented. The CNN techniques are also one of the 
more prevalent strategies utilized in image recognition. They are notable 
in the field of data science for their object detection and computer vision 
frameworks. CNN employs DL algorithms, which take video or image input, 
assign weights and bias to various parts of the image, and then distinguish 
them from one another. They are based on discrete convolution and try to use 
the spatial information among the pixels of an image [99]. The categorical 
appraisal of loT research that used CNN techniques for its implementation is 
scarce, especially the one that identifies the prominent areas of applications of 
IoT-enabled and CNN techniques were used. To improve this area of research, 
the IoT-enabled CNN applications challenging area were also revealed for 
possible improvements. Table 2.2 presents loT-enabled CNN studies, tech- 
niques, fields of applications, as well as limitations of the studies in order to 
ensure the prospects of the applications. 

There are numerous prospects in the combination of these prominent and 
promising computing fields CNN and IoT, but a number of modern challenges 
are associated with techniques. The deployment of technologies such as 
drones, robots, autonomous vehicles, and other IoT tools has brought the 
revolution to data capturing and transmission and data analysis for informed 
decision making has become easier with CNN techniques. However, these 
achievements are threatened through, computational cost, coverage, energy 
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Table 2.2 The loT-enabled CNN studies in different field techniques and challenges 


IoT-enabled CNN studies | Techniques) Fields of Challenges References 
applications 

The IoTNet, a new model | IoTNet Technology Augmentation | [100] 

for resource-constrained strategies of 

situations which provides CNN were 

state-of-the-art not utilized 

performance in the realm 

of tiny efficient models, 

was proposed to manage 

computational complexity 

in image classification. 

A mobile-based strategy VGG-16 Security and Non-uniform | [101] 

called DCNN for (a pre- economic distribution 

counting individuals is trained values of clutter, 

proposed. The proposed CNN) computa- 

mobile-friendly model is tional cost, 

capable of detecting practicality 

information quickly; with IoT 

people are counted 

everywhere and in 

various places. 

A CNN model for picture | IoT-Net Security The model 21] 

learning and intelligence developed 

that is context-aware with exhibited 

an active CCTV robot little 

IoT platform was operation 

developed to detect time due to 

anomalous conditions in battery 

various industrial inefficiency 

settings, factories, logistic 

warehouses, smart farms, 

and public areas. 

A CNN sleep apnea IoT- Healthcare Smaller [90] 

detection loT-enabled sensors models 


method that is based on 
ECG and EDR signals 
was developed. The 
model could detect sleep 
apnea on a 
minute-by-minute basis 
or a window of 30 
seconds. 


suitable for 
sleep apnea 
detection in a 
few seconds 
for the patient 
with minimal 
tuning were 
not used 
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Table 2.2 Continued. 


IoT-enabled CNN studies Techniques] Fields of Challenges References 
applications 

A unique IoT-enabled DWS- Healthcare The data rate, | [91] 

depth-wise separable CNN | CNN-IoT and energy 

with SVM was proposed consumption 

for COVID-19 diagnosis 

and classification. The 

algorithm aims to identify 

all class labels of 

COVID-19 using CXR 

images. 

A system for detecting CNN-NS | Security Computational | [92] 

data packet breaches in cost, 

network security 1s coverage, 

introduced. The system energy 

follows a CNN-based consumption, 

approach for instruction and security 

detection. 

The cost-effective ways of | loMT- Healthcare Use of [93] 

improving healthcare sensors multiple 

living facilities by protocols and 

integrating remote health devices, 

monitoring and Internet of resulting in 

medical things were high energy 

proposed, and the system consumption, 

was used to monitor and data 

diabetic patients” vital intrusion 

signals in real time 

successfully. 

Digital technology plays a | IoT- Health The lack of [102] 

vital role in supporting sensor suitable 

social, professional, and safeguards 

economic activities when and rules in 

people are confined to handling 

their homes. The Internet personal 

of Things has a track medical data, 

record of providing and privacy, 

high-quality remote increases the 

healthcare and automation vulnerabili- 

services, allowing for ties 

social separation while associated 

sustaining population 

health and well-being. 


(Continued.) 
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Table 2.2 Continued. 


IoT-enabled CNN studies | Techniques) Fields of Challenges References 
applications 

IoMT-based systems IoT- Health Security [103] 

must be able to not only sensor 


maintain acceptable 
performance in the face 
of such changes but also 
react appropriately if 
necessary. These loMT 
systems also create 
security concerns since 
they frequently regulate 
conditions in which a 
system failure might be 
fatal. 


consumption securing the security of the architecture devices and data. 
The lack of suitable safeguards and rules in handling personal medical 
data, as well as algorithms for assuring privacy, and security, improves the 
vulnerabilities associated with the CNN-IoT system. 


2.6 Conclusion and Future Direction 


The emergence of new generations of networks is assisting cloud com- 
puting environment and indeed IoT-based solutions. It connects computing 
devices with mechanical, electronics, and other objects such as animals or 
a human with the help of unique identifiers that have the ability for data 
transmission through a network without the intervention of a computer or 
human being. The chapter discusses the application of AI- and IoT-enabled 
in various fields of human endeavor, highlighting the novelty of these state- 
of-the-art technologies. We identified prominent researches in the area of 
health, technology, security, where loT-enabled and CNN technologies were 
applied, and discussed their contributions to knowledge. The CNN and its 
architectures were reviewed and the research was summarized according to 
the field of application and the areas of limitations were identified with par- 
ticular attention on healthcare, technology, humanity, security, infrastructure 
development, and agriculture with respect to techniques. Our contribution 
is three-fold. First, we identified the loT-enabled CNN applications in the 
popular field. Second, we discussed some insights for selecting appropriate 
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techniques to use for end-users of CNN tools in the IoT environment. Third, 
we identify CNN-IoT challenging areas for improvements. The three CNN 
operational layout steps’ effective usages of a CNN architecture are discussed 
as follows: (1) data collection and feature generation, (2) data preprocessing 
and feature selection, and (3) data partitioning, learning, and analysis. 

This chapter recommends more collaborations between researchers and 
the users and experts in the areas where such IoT applications are used to 
overcome limitations, enhance IoT technologies, and provide an insight into 
the security of future work in cyberspace. Going by the numerous challenges 
such as computational cost, data rate, coverage, energy consumption, secu- 
rity, and the applicability of CNN with IoT identified in various fields, the 
diction of research toward implementation of CNNs enables IoT to provide 
the solution that will peoples’ needs in the area of healthcare, education, 
security, infrastructure development, agriculture, and so on, so as to address 
factors that are inimical to people to harness the esteem benefit in AI-IoT. 
Specifically, we suggest the following direction of future work based on our 
review: Since CNNs are primarily used for image processing, implement- 
ing state-of-the-art CNN architectures on sequential data necessitates the 
translation of 1D data into 2D data. The interoperation of the system and 
management of quality of service (QoS) were the issues observed in some 
studies, and the mobility and scalability problems were reported in real-time 
applications of loT-enabled technologies to support the novelties. The use 
of DCNNs for sequential data is more popular because of their good feature 
extraction ability, and efficient calculations with a few numbers of parameters 
for effective data transmission in a cloud environment should be looked into. 
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Abstract 


Recently, speech controlled smart devices play an important role in Internet 
of Things (IoT) applications. Both reverberation and noise may significantly 
reduce the efficiency of the human—machine interaction for indoor applica- 
tions. Therefore, speech enhancement becomes a critical front-end technique 
to improve the performance, which has attracted increasing attention in 
recent years. This chapter focuses on deep learning (DL) based monaural 
speech enhancement algorithms for both denoising and dereverberation, and 
both single and multiple speakers are considered to be extracted. More 
specially, convolutional neural network (CNN) based models are presented 
for this challenging speech enhancement task due to its parameter efficiency 
and state-of-the-art performance. After describing one-stage and multi-stage 
CNN-based models, numerous experiments are conducted to show the advan- 
tage and disadvantage when applying them to extract one desired speaker 
and multiple desired speakers. This study reveals that CNN-based models 
can achieve high performance when there is only one desired speaker to be 
extracted, while their performance may degrade a lot for multiple desired 
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speakers. Some potential strategies are discussed on improving the perfor- 
mance of extracting multiple desired speakers and future research directions 
are outlined finally. 


Keywords: Internet of Things, speech dereverberation, deep learning, multi- 
stage, nonlinear compression, multiple speakers. 


3.1 Introduction 


The Internet of Things (loT) becomes an important part of our daily life, 
which aims at connecting a vast number of devices and sharing their data 
over the Internet. Among them, speech-controlled devices are quite general 
for loT applications in smart home. For indoor applications, clean speech 
is frequently corrupted by both environmental noise and room’s late rever- 
beration, which may degrade speech quality/intelligibility and increase word 
error rate simultaneously, leading to affected user experience. Therefore, 
speech enhancement, including noise reduction and dereverberation, becomes 
a critical and hot front-end technique to improve the convenience of using 
speech-controlled loT devices. We focus on the research of monaural speech 
enhancement algorithms in this chapter, as a majority of loT devices only 
have one microphone for speech acquirement. 

In the last half century, a vast number of noise reduction algorithms 
have already been proposed, which can be, at least roughly, divided into 
two kinds: conventional statistical signal processing (SSP) based algorithms 
and neural network (NN) based algorithms. Typical SSP based algorithms 
include spectral subtraction [1], Wiener filtering [2], minimum mean square 
error (MMSE) based short-time spectral amplitude estimator and MMSE- 
based log-spectral amplitude estimator [3, 4], subspace-based approaches 
[5], and so on. These conventional SSP-based algorithms can effectively 
suppress stationary and/or quasi-stationary noise, but their performance may 
degrade a lot, especially when handling the extremely non-stationary noise 
[6]. On the contrary, noise reduction algorithms with neural networks, such 
as shallow neural networks (SNNs) and deep neural networks (DNNs) or 
deep learning (DL) networks, can often achieve much better performance in 
more challenging scenarios, such as non-stationary noise environments and 
low signal-to-interference ratio (SIR) cases. In the early phase, the NN-based 
supervised speech enhancement algorithms were inspired by the theory of 
computational auditory scene analysis (CASA) [7, 8]. These methods can 
be named as time-frequency (T-F) mask mapping based methods, aiming to 
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estimate real ideal ratio mask (rIRM), real ideal binary mask (rIBM), and 
complex ideal ratio mask (cIRM). After estimating the T-F mask, it was then 
multiplied with the noisy complex spectrum for noise reduction [9, 10]. In 
the late phase, more efficient methods that map the clean spectrum directly 
from the noisy spectrum were proposed, namely spectral mapping methods 
[11, 12]. The time-domain mapping-based algorithms were proposed in [13], 
which have been shown to be more effective in speech separation. 

The environmental noise usually refers to additive noise, while the 
early reverberation is a typical kind of convolutional interference and the 
late reverberation is often assumed to be uncorrelated/independent with the 
early reflections, and, thus, the late reverberation can also be regarded as 
additive noise [14]. Accordingly, many SSP-based algorithms have already 
been proposed for dereverberation, such as spectral subtraction [1], weighted 
prediction error (WPE) [15, 16], and inverse filtering [17, 18]. Recently, more 
studies perform the dereverberation task in a supervised way [19, 20, 21, 22]. 
The same as NN-based denoising methods, mask mapping and spectral 
mapping are the two effective training schemes, where the spectral mapping 
includes log-power spectrum (LPS) mapping [23], target magnitude/complex 
spectrum (TMS) mapping [19], and compressed magnitude/complex spec- 
trum (CMS) mapping [22]. The above-mentioned DL networks for noise 
reduction have already been extended for speech dereverberation, and they 
have shown better performance than traditional SSP-based dereverberation 
algorithms. 

In most cases, noise reduction and dereverberation are considered sepa- 
rately. It is known to all that, in real-world acoustical environments, additive 
noise and reverberation often co-exist and both need to be suppressed to 
reduce their influence on speech quality and speech intelligibility. It has 
not been sufficiently addressed to handle both the noise and the reverbera- 
tion, especially when multiple speakers need to be enhanced simultaneously. 
Recently, several NN-based monaural speech enhancement algorithms have 
been proposed, aiming to achieve joint removal of the noise and reverberation. 
For instance, in [26], a framework that jointly implements denoising and 
dereverberation was proposed, where the multi-target optimization is decom- 
posed into four stages, and the denoising module and the dereverberation 
module are implemented sequentially. Moreover, a phase-aware mask and a 
complex ratio mask were proposed in [27] and [28], respectively, tackling the 
denoising task and the dereverberation task with a single-stage framework. 
An end-to-end WaveNet was proposed in [29] to suppress both the noise and 
reverberation. 
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Typical DL networks include fully connected neural networks (FCNs) 
[30], recurrent neural networks (RNNs), e.g., long short-term memory 
(LSTM) layers [31, 32, 33], and convolutional neural networks (CNNs) 
[8, 34, 35]. It is noticeable that both CNNs and RNNs have shown excellent 
capability in improving the perceptual evaluation of speech quality (PESQ) 
score [36], the extended short-time objective intelligibility (ESTOD score 
[37], and the signal-to-distortion ratio (SDR) [38], and they respectively 
have properties of requiring much fewer trainable parameters and standout 
generalization ability [39], and, thus, these two kinds of models have become 
the most widely used models in the last decade. More recently, the integration 
of CNN and RNN has arisen, such as convolutional recurrent networks 
(CRNs) [11] and gated convolutional recurrent networks (GCRN) [12]. These 
models can benefit from the advantages of the CNN and RNN models 
simultaneously. 

To be emphasized, most of these state-of-the-art methods focus on 
extracting only one desired speaker. In many practical speech communi- 
cation applications, multiple speakers may speak at the same time and all 
of them need to be enhanced and then transferred to far end, while this 
problem has not been well studied. Besides, many aforementioned NN- 
based methods have been proved that the phase recovery is also important 
for speech enhancement in improving perceptual speech quality in low 
SNR scenarios and high reverberation cases [22, 27, 28]. Thus, a two- 
stage complex network (TSCN) has been proposed in this literature [40]. 
The TSCN is made up entirely of CNN layers and decouples the optimiza- 
tion of magnitude and phase to achieve better performance. Note that it 
mainly focuses on noise reduction applications in this previous work. In 
this chapter, we compare one-stage and multi-stage CNN-based models to 
tackle the simultaneous denoising and dereverberation problem. The advan- 
tage and disadvantage are clearly revealed when applying these two CNN- 
based models on both one desired speaker and multiple desired speakers 
scenarios. 

This chapter is organized as follows. Section 3.2 describes the signal 
model, and we also formulate the problem in this section. We give the details 
of the one-stage and multi-stage CNN models in Section 3.3. Section 3.4 
presents the experimental setup, followed by the experimental results and 
some analysis in Section 3.5. Finally, we draw some discussions and give 
some conclusions in Section 3.6. 
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3.2 Signal Model and Problem Formulation 
3.2.1 Signal Model 


For indoor applications, the time-domain noisy-reverberant speech received 
at a microphone can be given by 


C 
y(t) = X st) = hO) + n), (3.1) 
c=] 


where s(t), h(%(£), and n(t) denote the cth clean speech source, its 
corresponding room impulse response (RIR), and the environmental noise at 
the time index t, respectively. C > 1 is the number of clean speech sources. 
* stands for the convolution operator. For many practical applications, it is 
common to divide (©) (t) into two parts, namely, the direct impulse response 
no (e) and the reverberation impulse response ni? (t). Therefore, the noisy 
and reverberant speech signal can be expressed as 


C 
y(t) => ES (t) «AD (E) + sO (t) * hO (0) + n(t) 


c=1 
C C (3.2) 
= y xO (t) + oS r(t) + n(t) 
c=1 c=1 


= x(t) + r(t) + n(t), 


where x(t) and r(t) represent the desired direct speech components and the 
reverberant speech components of C desired speakers, respectively. 


3.2.2 Feature Extraction 


Instead of recovering speech in the time domain directly, various studies have 
shown that it is more advantageous to implement speech enhancement in 
the T-F domain [5, 12, 30]. With short-time Fourier transform (STFT), the 
spectral patterns of the noise components and the speech components can be 
effectively decomposed and, thus, can be distinguished more easily, which 
can be beneficial for modeling training. Therefore, in this chapter, we study 
to handle the reverberation and the noise in the T-F domain. Specifically, 
taken the STFT on both sides of the eqn (3.2), the time-frequency (T-F) signal 
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model can be given by 


C 
Y(k, 1) =X SOK, DAY (k,1) + SO (k, DHO (kD +N(,1) 
c=1 
C 
= YX Mk, 1) DL ul (k, 1) + N(k,1) 
c=1 


= X(k,l) + RED + N(k,l), 

(3.3) 
where Y (k, l) is the complex spectrum of y(t), and the remaining variables in 
eqn (3.3) have similar definitions. k and l denote the frequency index and the 
time index, respectively. The index (k,l) pair will be omitted in the following 
when no confusion arises. 

In [22], compressed complex spectra are introduced instead of the original 
complex spectra as the input as well as the output of the CNN model. Note 
that only the magnitudes of the complex spectra are compressed while the 
phase information is unchanged. This is because phase spectra almost have 
no regular structure, and it is often difficult to recover them accurately. With 
the compressed scheme, the complex spectrum with magnitude compression 
can be expressed as Y? = |Y|7e’®Y, which can also be written as 


Yo = A A + Ye 
l ; (3.4) 
= |Y |” cos 9y + ¡[Y |" sin Oy, 
where 7 = y—1, (-),, and (-); indicate extraction of the real part and the 
imaginary part of a complex spectrum, respectively. y denotes the power 
compression parameter and 0y is the phase of Y. Pairs of {Y,°, Y,°} are 
chosen as the input features throughout this chapter. 


3.2.3 Problem Formulation 


For denoising only, we only suppress the noise without handling reverberant 
speech components, and the target of this task is to extract x(t) + r(t) or 
X + R from y(t) or Y. On the contrary, for dereverberation only, we only 
suppress reverberant speech components without suppressing the noise, and 
the target of dereverberation is to extract x(t) + n(t) or X + N from y(t) 
or Y. When both denoising and dereverberation are necessary, the target is to 
extract x(t) or X from y(t) or Y. For many applications, such as automatic 
speech recognition systems, hands-free speech communication devices, and 
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hearing-assistive devices, 1t is highly desirable to suppress both the noise and 
late reverberant speech components. 

In this study, we take the approach of compressed complex spectrum 
mapping for simultaneously denoising and dereverberation, and a neural 
network with fully CNN layers is utilized to recover the complex spectrum 
of direct speech signal. The input features of the network are the compressed 
complex spectrum of the noisy-reverberant speech, i.e., {Y,°, Y,°}, and the 
target for network training is the compressed complex spectrum of the direct 
speech, i.e., [X?, X? }. The mapping process can be written as 


A TAO, (3.5) 


where G(-,-;®) denotes the mapping function of the network with the 
parameter set ®, and the symbol (-) denotes the estimated value. 

Finally, we can recover the uncompressed complex spectrum from the 
estimated compressed complex spectrum, which can be given by 


X= XX, 


£ Bs (3.6) 
= (XP) cos0 + (XI) sin Og. 


Because complex spectrum compression can be regarded as the pre- 
processing and post-processing of NN-based denoising and dereverberation 
algorithms, we omit the superscript o and the compression parameter y in 
the following to facilitate notations in this study. The same as [41], we set 
y = 0.5. 


3.3 One-stage CNN-based Speech Enhancement 


As mentioned above, CRNs become popular for better capturing the short- 
term and long-term dependencies of speech and parameter efficiency. To 
further improve parameter efficiency, squeezed temporal convolutional mod- 
ules (S-TCM) were introduced to replace the LSTM modules in GCRN [40], 
where we name this network as gated convolutional TCM Net (GCT-Net). 
GCT-Net consists entirely of CNNs and is adopted as the basic network for 
both one-stage and multi-stage speech enhancement in this study. 


3.3.1 The Architecture of GCT-Net 


GCT-Net, as shown in Figure 3.1(a), was trained to map the complex spec- 
trum of the direct speech from that of its corresponding noisy-reverberant 
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Figure 3.1 (a) Diagram of the GCT-Net. (b) The encoder of GCT-Net having six Conv-GLU 


blocks. (c) The decoder of GCT-Net having six Deconv-GLU blocks. 


version, which consists of three modules, namely encoder, stacked S-TCMs, 
and decoder. Specifically, the encoder comprises six convolutional gated lin- 
ear unit (Conv-GLU) layers for complex spectral feature extraction as shown 
in Figure 3.1(b), and the two decoders have the same structure as the encoder, 
where both of the two decoders contain six deconvolutional gated linear 
unit (Deconv-GLU) layers as shown in Figure 3.1(c). These two decoders 
output the real and imaginary parts of the complex spectrum of the direct 
speech, respectively. Between the encoder and decoder modules, three groups 
of stacked S-TCMs are adopted to capture the sequential information, each 
of which contains six S-TCM units with exponentially increasing dilation 
rate to enable the model to have a large temporal receptive field. Moreover, 
we introduce skip connections to concatenate the output of each encoder 
layer with the input of its corresponding decoder layer. In this GCT-Net, all 
convolutional layers and deconvolutional layers are causally implemented. In 
other words, there is not any future information being applied to infer the 
current output. 


With the one-stage speech enhancement, the mapping process can be 
formulated as 


{X;, Xr} = Gacr (Yr, Yi; Pac), (3.7) 


where Gecr(-,:; Paer) is the mapping function of GCT-Net, and ®gcr is 
the corresponding parameter set. 
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Figure 3.2 (a) The structure of Conv-GLU. (b) The structure of Deconv-GLU. o(-) denotes 
a sigmoid function. 


3.3.2 Gated Linear Units 


Gate structure can control the data flow in the network, so that it allows 
for modeling more complex networks. For example, by endowing different 
gates with specific functions, RNNs can build long-term memory in modeling 
sequential data. Accordingly, the gated linear units (GLU) were proposed in 
[42], which can be given by 


v= (Wu + bi) 8 (Wu + ba), (3.8) 


where u and v are the input and output of the gated linear units. Wa, and 
We are the convolution weights and bj, and bg are their corresponding 
biases. o(-) denotes a sigmoid function, which compresses the value from 0 
to 1. In this study, GLUs are integrated into both the convolutional layers and 
deconvolutional layers in the encoder and the two decoders. The Conv-GLU 
block is illustrated in Figure 3.2(a), where parametric ReLU (PReLU) and 
normalization (Norm) layers are applied after the multiplication of eqn (3.8). 
The Deconv-GLU block has a structure similar to the Conv-GLU, as plotted 
in Figure 3.2(b). 


3.3.3 S-TCMs 


TCMs have already been widely used in speech separation [13, 43] and 
speech enhancement [40, 41]. It can even perform better than LSTM in the 
aspect of temporal sequence modeling, and due to the parallel convolution 
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Figure 3.3 Diagram of an S-TCM. o(-) denotes a sigmoid function. 


mechanism, it can reduce inference time remarkably. Figure 3.3(a) presents 
the architecture of TCM, which is the same as [41]. 

Although stacked TCMs have obtained satisfactory performance, the 
parameter burden problem is still noticeable, which is not suitable for some 
portable devices. Thus, we adopt a squeezed TCM (S-TCM) [41], substituting 
depth-wise convolution with regular convolution. The diagram of this S-TCM 
unit is shown in Figure 3.3(b). Different from TCM, the proposed S-TCM unit 
squeezes the input channel into a lower dimension by a regular dilated 1 x 1- 
Conv to decrease the number of parameters. Besides, inspired by GLUs, we 
also introduce another gated branch. Its structure is similar to the main branch 
except that the sigmoid function is adopted, which is beneficial for the back 
propagation of gradients in the network. 


3.3.4 Framework Details 


The detailed parameters of GCT-Net are presented in Table 3.1. The size of 
input features is given as (TimeStep x FeatureSize) and (ChannelNum x 
TimeStep x FeatureSize). The hyper-parameters are given as (KernelSize, 


3.3 One-stage CNN-based Speech Enhancement 75 


Table 3.1 Detailed parameter setup for GCT-Net and ME-Net, where ô denotes the number 
of input channels, and T denotes the TimeStep 


Layer name Input size Hyper-parameters Output size 
Conv-GLU_1 ô x T x 257 2 x 5, (1,2), 64 64 x T x 127 
5 Conv-GLU_2 64 x T x 127 2 x 3, (1, 2), a 64 x T x 63 
3 Conv-GLU_3 64 x T x 63 2 x 3, (1,2), 6 64x T x 31 
Š Conv-GLU_4 64x T x 31 ZX SALZ = 64xT x15 
Conv-GLU_5 64xT x15 2 x 3, (1,2), 64 64x Tx 7 
Conv-GLU_6 64x Tx 7 2 x 3, (1,2), 64 64x Tx 3 
Reshape_Size_1 64x Tx 3 - T x 192 
1, 1,64 
o 
3 5, 1, 64 
3 
g 1,1,192 
E 1,1, 64 
3 5,2, 64 
3 1, 1,192 
E 1,1,64 
S 
= 5, 4, 64 
> S-TCMs T x 192 3 x : 1, ei) T x 192 
1,1,64 
5,8,64 
1,1,192 
1,1,64 
5, 16, 64 
1,1,192 
1,1, 64 
5, 32, 64 
1,1,192 
Reshape_Size_2 T x 192 - 644 xTx3 
skip_connect_1 64x T x 3 - 128 x T x 3 
Deconv-GLU_1 128x Tx 3 2 x 3, (1,2), 64 6xTx7 
Skip_Connect_2 64x Tx 7 - 128 x T x 7 
Deconv-GLU_2 128x Tx 7 2 x 3, (1,2), 64 64x Tx 15 
5 Skip_Connect_3 64xTx15 - 128 x T x 15 
i Deconv-GLU_3 128x T x 15 2 x 3, (1,2), 64 64x T x 31 
E Skip_Connect_4 64 x T x 31 - 128 x T x 31 
Deconv-GLU_4 128 x T x 31 2 x 3, (1,2), 64 64 x T x 63 
Skip_Connect_5 64 x T x 63 - 128 x T x 63 
Deconv-GLU_5 128 x T x 63 2 x 3, (1, 2), 64 64 x T x 127 
Skip_Connect_6 64 x T x 127 - 128 x T x 127 
Deconv-GLU_6 128 x T x 127 2 x 5, (1, 2), 64 64 x T x 257 
Reshape_Size_3 1x T x 257 - T x 257 
Linear_1(Softplus) T x 257 257 T x 257 
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Stride, ChannelNum) for the encoder and the decoder, and (KernelSize, 
DilationRate, ChannelNum) for S-TCMs. 6 refers to the number of input 
channels. 

More specifically, within each Conv-GLU block, the kernel size in the 
time axis is set to 2, and that in the frequency axis is set to 5 in the first 
block and 3 in the following blocks. The stride is set to (1,2), so that the 
frequency size can be halved gradually, while the time size keeps unchanged. 
The number of channels in each convolution layer is 64. Besides, instance 
normalization and PReLU layers are adopted after each convolutional layer. 
The decoder has an architecture similar to the encoder, which replaces all 
Conv-GLUs by Deconv-GLUs. Moreover, skip connection is introduced 
between each Conv-GLU and Deconv-GLU pair. 

In the bottleneck of GCT-Net, three groups of stacked S-TCMs are 
adopted to gradually capture the long-range temporal information, and each 
of them has six S-TCM units with their dilation rates exponentially increased, 
1.e., (1,2,4,8,16,32). Note that the module consisting of stacked S-TCMs 
is causal because no future frames are necessary to infer current frame 
output. 


3.3.5 Loss Function 


The mean square error (MSE) is one of the most widely used loss functions 
for training neural network models, even in the application of complex spec- 
trum mapping [12]. Recently, an integration of the magnitude constraint and 
the complex spectrum constraint has been proven to be essential for spectral 
recovery [41, 44]. In this study, the following loss function is considered: 
Lect = ali, + (1 — alae (3.9) 


gct > 


where Le and oe denote the constraint of the complex spectrum and that 


of the magnitude, especial which are given by 


ee Po El (3.10) 
= 7112 
Lais = || x? + x? - 2 + 27], (3.11) 


where a € [0,1] is a tuning parameter to make a trade-off between the RI 
loss and the spectral magnitude loss. We empirically set a = 0.5 in this 
study. 
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3.4 Multi-Stage CNN-based Speech Enhancement 


Figure 3.4(a) plots the two-stage network based on the (compressed) complex 
spectrum mapping (CTS-Net). It shows superior performance in the second 
deep noise suppression (DNS) challenge [40]. The details of CTS-Net can be 
found in [41], where ME-Net aims to estimate the raw magnitude, denoted 
by ES mej, and CS-Net aims to reduce the residual noise components and 
recover the real and imaginary parts of the residual complex spectrum, 
denoted by Xs, which is finally added together to obtain the estimation of 
the complex spectrum. The whole procedure of the two-stage network can be 
formulated as e 

|ă | = Gr (Y |; 81), (3.12) 

=$ 


Re = RR ei), Are (Rei), G13) 


be 
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Layer ES ie 
A 


Couple | pa rey Globel Residual Connection 
0 Layer | | 
Lee Fe} 


(a) diagram of CTS-Net 


ME-Net 
Skip Connection 
Bien | us) mi 3 i | 
Mag 381 33,33 Mas || jgn 
rl Encoder] PSESE sal! di cae al de | 
lam nn Ani | i 
Skip Connection 
(b) diagram of ME-Net 
CNet Skip Connection 
fe ee | “Real > je 
aoe | i Decoder r 
(rr) Com. B| BE i 
is > SSeS UL SOE SL 
{im Xm) Encoder | | | $ E S E 3 i pe 
roi ps | a a a | AE. ae 
do abies i Decoder y 
Skip Connection x 


(c) diagram of CS-Net 


Figure 3.4 Diagrams of CTS-Net, ME-Net, and CS-Net. 
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{XP XO) A, OX Ds): (3.14) 
Xp = Xe + XO, Xi = XM + XP, (3.15) 


where Oy denotes the original noisy phase, and R and S denote to extract the 
real and imaginary parts, respectively. Şi (-; 91) and Go(-,-,-, -; Da) represent 
the nonlinear mapping function of ME-Net and that of CS-Net, respectively. 


3.4.1 Framework Structure 


The detailed structure of ME-Net and that of CS-Net are presented in 
Figures 3.4(b) and 3.4(c), respectively. One can see that ME-Net has similar 
topology to GCT-Net, except that ME-Net has only one decoder for mag- 
nitude estimation. Besides, the topology of CS-Net is the same as that of 
GCT-Net, but it aims to estimate the residual RI components rather than the 
desired RI components. Therefore, the hyper-parameters setups for ME-Net 
and CS-Net are nearly the same as GCT-Net, which is presented in Table 3.1. 
Note that 6 = 1 for ME-Net because there is only one input channel. 
Moreover, because the range of the spectral magnitude |X™°| € [0,00), 
Softplus is chosen as the activation function of the last layer. Different from 
ME-Net, the linear outputs are adopted in CS-Net because both the real and 
imaginary parts of the complex spectrum have unbounded ranges. In addition, 
as the inputs of CS-Net contain both original and coarsely estimated complex 
spectra, the number of input channels is set to ô = 4. 


3.4.2 Loss Function 


We apply a two-stage training strategy to train CTS-Net. First, ME-Net is 
trained with the MSE loss, which is 


~ 2 
Ere = RJS] r E (3.16) 


Second, ME-Net and CS-Net are trained jointly, and the loss function can be 
finally expressed as 


L = Les + AL me, (3.17) 
where A is a positive real value, and the loss Les is defined as 
Les = at + (1 — a) DM, (3.18) 
where 
~ 2 ~ 2 


Ce A E eS (3.19) 
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= — 2 
Lo” = XP4+X?- +R o. (3.20) 


In this study, set a = 0.5 and A = 0.1 which is the same as [41] for joint 
denoising and dereverberation. 


3.5 Experimental Setup 
3.5.1 Datasets 


In this study, we conduct all the experiments on the DNS-Challenge clean 
speech dataset [45], which is taken from the public audio books dataset called 
Librivox. 45,000, and 3,000 utterances are randomly selected for training 
and validation, respectively. 150 utterances of other untrained speakers are 
selected for models evaluation. To simulate noisy environment, we randomly 
selected around 550 noise processes from the DNS-Challenge datasets [45] 
as the noise set for training and validation. 

As for simulating indoor reverberated environment, three sets of reverber- 
ant rooms are generated, namely small room set, medium room set, and large 
room set. Each room set contains 200 rooms of different sizes. Similar to [46], 
the width and length of the rooms in small room set, medium room set, and 
large room set are sampled uniformly from 1 to 10 m, 10 to 30 m, and 30 to 
50 m, respectively. The height of each room ranges from 3 m to 5 m, and room 
absorption coefficient ranges from 0.2 to 0.8. In each room, a receiver position 
is randomly sampled and then 100 RIRs are randomly generated according to 
different speaker positions. The image method is used to simulate RIRs [47]. 
Besides, as small size rooms have more complicated acoustical environment 
and have been widely concerned in the field of speech dereverberation, extra 
RIRs of 28 small rooms are simulated according to uniformly sampled Teo 
from 0.3 to 1.0 s. Therefore, we obtain totally 62,800 RIRs for simulating 
training utterances. 

To generate training set, we consider one desired speaker and multiple 
desired speakers speech communication scenarios; so two types of datasets 
are prepared for both training and validation sets, namely one-speaker dataset 
and multi-speaker dataset. 

In the one-speaker dataset, each clean utterance for training and valida- 
tion is convolved with a random RIR to generate a reverberated speech. The 
reverberated speech is then added with a randomly selected noise under a 
randomly chosen SNR ranging from 10 to 30 dB. As a result, we totally 
generate 45,000 and 3000 utterances in one-speaker dataset for training and 
validation, whose duration is about 100 and 7 hours, respectively. 
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As for the multi-speaker dataset, it contains half one-speaker dataset 
and half two-speaker dataset. The one-speaker dataset is generated in the 
same way as above, which has totally 22,500 and 1500 utterances for train- 
ing and validation, respectively. To generate the two-speaker dataset, two 
reverberated speech signals of the same room are randomly selected and 
mixed first, which is then added with a randomly selected noise under the 
SNR ranging from 10 to 30 dB. As the same with one-speaker dataset, the 
two-speaker dataset has totally 22,500 and 1500 utterances for training and 
validation, respectively. Thus, there are around 100 and 7 hours of speech 
data are generated for training and validation in both one-speaker dataset and 
multi-speaker dataset. 


3.5.2 Parameter Configuration 


We resample all the utterances to 16 kHz and chunk them to 8 s. 32 ms 
Hanning window is applied before STFT, with the frame shift of 8 ms. 512- 
point fast Fourier transform (FFT) is applied to extract the spectral features. 
Each model is trained for 50 epochs using the Adam optimizer [48]. For the 
one-stage GCT-Net and the first stage of the CTS-Net, the learning rate is set 
to 0.001, which is halved when the validation loss increases for consecutive 
three epochs. In the second stage of the CTS-Net, the learning rate of ME-Net 
is fine-tuned as 0.0001, and that of CS-Net is set to 0.001. We set the batch 
size to 16. 


3.6 Results and Analysis 


To evaluate and compare the performances of the one-stage and the multi- 
stage CNN models, we test GCT-Net and CTS-Net on a prepared test set. This 
evaluation also aims to investigate the advantage and disadvantage of these 
models in extracting one and/or multiple desired speech signals. There are 
seven values of Teo in this dataset, i.e., {0, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2} s, and 
for each Teo, 150 mixed speech signals and 150 pairs of RIRs of untrained 
small room configurations are generated. Both the seen and unseen noise 
scenarios are considered when generating the test set, where the unseen noise 
are taken from NOISEX-92 [49], namely babble and cafe. 

Similar to the training set and the validation set, two types of datasets are 
generated in the test set, namely one-speaker test dataset and two-speaker test 
dataset. In one-speaker test dataset, an RIR is randomly chosen from the 150 
generated RIRs and is convolved with clean utterances under each Teo. Then 
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it is mixed with a randomly chosen seen or unseen noise with SNR ranging 
from 10 to 30 dB. As for two-speaker test dataset, a pair of reverberated utter- 
ances is generated by randomly selecting two clean utterances and calculating 
the convolution of the two utterances and a pair of RIRs from the same room. 
Later, the two reverberated utterances are mixed first, and then the mixed 
utterance is mixed with a randomly chosen noise again, with the SNR ranging 
from 10 to 30 dB. Therefore, there are totally 1050 utterances in both one- 
speaker test dataset and two-speaker test dataset. During the experiments, 
all of these evaluation models are trained using Pytorch [50] with only one 
Nvidia Tesla V100. 

In the following, we demonstrate speech spectrograms using different 
CNN models and different training strategies for different values of 760. 
And then, we choose the perceptual evaluation of speech quality (PESQ) 
score [36], the signal-to-distortion ratio (SDR) [38], and the extended short- 
time objective intelligibility (ESTOI) score [37] as objective measurements 
to measure the perceptual speech quality, speech intelligibility, and speech 
distortion, respectively. 


3.6.1 Spectrograms 


To intuitively study the denoising and dereverberation performance of differ- 
ent models and different training strategies, we plot speech spectrograms of 
the enhanced speech signals with different models and training strategies in 
Figure 3.5 with the Teo value being equal to 0.6 s, where the mixed speech 
contains only one desired speaker, and the noise is unseen. In Figure 3.5, (a) 
is the clean speech spectrogram, (b) is the noisy-reverberant speech spectro- 
gram, (c) is the spectrogram of the enhanced speech with GCT-Net trained 
using one-speaker training set, (d) is the enhanced speech spectrogram with 
GCT-Net trained using multi-speaker training set, (e) is the enhanced speech 
spectrogram with CTS-Net trained using one-speaker training set, and (f) is 
the enhanced speech spectrogram with CTS-Net trained using multi-speaker 
training set. 

One can see from Figure 3.5 that both GCT-Net and CTS-Net have 
impressive dereverberation performance. By comparing Figures 3.5(c) and 
3.5(d), we see that the enhanced speech with CTS-Net has less reverberation 
than that with GCT-Net. For example, the temporal smearing phenomenon in 
Figure 3.5(e) is less obvious than that in Figure 3.5(c). Moreover, the time- 
frequency structure of the enhanced speech with CTS-Net is more clear than 
that with GCT-Net, e.g., in the time interval [0.9 1.2] s. From Figures 3.5(e) 
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Figure 3.5 Speech spectrograms with the reverberation time Teo = 0.6 s containing only 
one desired speaker of (a) clean speech, (b) noisy-reverberant speech (PESQ = 1.98), (c) the 
enhanced speech with GCT-Net trained by one-speaker training set (PESQ = 2.50), (d) the 
enhanced speech with GCT-Net trained by multi-speaker training set (PESQ = 2.48), (e) the 
enhanced speech with CTS-Net trained by one-speaker training set (PESQ = 2.56), and (f) 
the enhanced speech with CTS-Net trained by multi-speaker training set (PESQ = 2.54). 


and 3.5(f), one can see that CTS-Net trained by one-speaker training set 
can suppress slightly more reverberant components than that trained by 
multi-speaker training set. 

In Figure 3.5, there is only one desired speaker. While in Figure 3.6, we 
test different CNN models and different training strategies for two desired 
speakers cases, and the To is set to 0.6 s. The same as the above three 
figures, the speech spectrograms of Figures 3.6(c) and 3.6(d) demonstrate that 
CTS-Net outperforms GCT-Net in terms of denoising and dereverberation 
because both desired speech signals can be better recovered. For example, 
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Figure 3.6 Speech spectrograms with the reverberation time Teo = 0.6 s containing two 
desired speakers of (a) clean speech, (b) noisy-reverberant speech (PESQ = 1.42), (c) the 
enhanced speech with GCT-Net trained by one-speaker training set (PESQ = 1.64), (d) the 
enhanced speech with GCT-Net trained by multi-speaker training set (PESQ = 1.96), (e) the 
enhanced speech with CTS-Net trained by one-speaker training set (PESQ = 1.69), and (f) 
the enhanced speech with CTS-Net trained by multi-speaker training set (PESQ = 2.08). 


around 1 s in Figure 3.6(c), the spectral component of the weaker speech 
signal is almost removed, but it can be better reconstructed in Figure 3.6(e). 
Moreover, by comparing Figures 3.6(e) and 3.6(f), one can observe that, in 
the time interval [1.4, 1.5] s, the spectral harmonic structure in Figure 3.6(f) 
is more clear, indicating that CNN models trained by multi-speaker training 
set perform better than that trained by one-speaker training set, especially 
when two desired speakers exist. To investigate the performance of different 
CNN models as well as different training strategies more comprehensively, 
objective measures are conducted in the following. 
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3.6.2 PESQ Scores 


Table 3.2 gives the PESQ scores of different models in the seen noise test set. 
One can get that the performance of CTS-Net is better than that of GCT-Net 
in all situations. Thus, we can conclude that the two-stage training strategy 
shows much more efficiency than the one-stage training strategy in improving 
speech quality. By comparing the results of the one-speaker test set and 
the two speaker test set in Table 3.2, one can see that, if there is only one 
desired speaker in the test set, CTS-Net trained with the one-speaker training 
set can obtain slightly higher PESQ scores than that trained with the multi- 
speaker training set. On the contrary, GCT-Net trained with multi-speaker 
training set performs better than that trained with one-speaker training set. 


Table 3.2 The PESQ scores of different models in the seen noise test set 


7. One speaker test set Two speaker test set 
Metrics | - °° One speaker | Mix-speaker One speaker | Mix-speaker 
(ms) training set | training set training set | training set 


Mix | GCT CTS | GCT CTS | Mix | GCT CTS | GCT CTS 
1200 | 1.62 | 2.06 2.16 | 2.09 2.16 | 1.41 | 1.58 1.61 | 1.75 1.79 
1000 | 1.71 2.17 2.28 | 2.20 2,28 | 1.50 | 1.69 1.71 | 1.87 1.91 
800 | 1.80 2.30 2.41 | 2.33 2.40 |1.60| 1.80 1.83 | 2.00 2.04 
PESQ | 600 | 1.94] 2.49 2.60 | 2.52 2.58 | 1.74] 1.98 2.04 | 2.21 2.25 
400 | 2.17 | 2.77 2.89 | 2.80 2.87 | 1.99] 2.24 2.31 | 2.47 2.52 
200 | 2.70 | 3.25 3.34 | 3.28 3.33 | 2.58] 2.80 2.84 | 3.05 3.09 

O | 3.12] 3.69 3.72 | 3.72 3.70 | 3.38) 3.72 3.72 | 3.84 3.81 


Table 3.3 The PESQ scores of different models in the unseen noise test set. 


7 One speaker test set Two speaker test set 
Metrics | > °° One speaker | Mix-speaker One speaker | Mix-speaker 
(ms) training set | training set training set | training set 


Mix | GCT CTS | GCT CTS | Mix | GCT CTS | GCT CTS 
1200 | 1.63 | 2.00 2.10 | 2.02 2.08 | 1.41 | 1.54 1.56 | 1.70 1.74 
1000 | 1.70 | 2.11 2.22 | 2.13 2.19 | 1.48 | 1.61 1.64 | 1.80 1.84 
800 | 1.76 2.18 2.29 | 2.19 2.27 | 1.57 | 1.74 1.76 | 1.93 1.97 
PESQ | 600 | 1.91 | 2.36 2.47 | 2.37 2.43 | 1.72] 1.90 1.97 | 2.12 2.17 
400 | 2.10 | 2.61 2.73 | 2.61 2.69 | 1.95 | 2.14 2.21 | 2.37 2.42 
200 | 2.57 | 3.04 3.13 | 3.04 3.11 | 2.48] 2.67 2.69 | 2.88 2.93 

O | 2.90] 3.41 3.43 | 3.39 3.40 | 3.03 | 3.39 3.35 | 3.46 3.46 
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This is because GCT-Net and CTS-Net trained with one-speaker set can 
suppress more reverberation than that trained with multi-speaker, but more 
speech distortion may be introduced. Furthermore, when there are two desired 
speakers at the same time, the PESQ scores obtained by training the multi- 
speaker set are much higher than those obtained by training the one-speaker 
set, no matter which CNN model is used. An explanation for this result is that, 
when two desired speakers co-exist, the speech distortion of CTS-Net trained 
with one-speaker may heavily reduce PESQ scores. Therefore, the CTS-Net 
that is trained by multi-speaker can obtain higher speech quality than that 
trained by one-speaker. 

We also test the PESQ metrics in the unseen noise test set to verify the 
generalization ability of different models in Table 3.3. The results of Table 3.3 
are similar to Table 3.2, when the target signal has only one desired speaker, 
CTS-Net trained by one-speaker is the best one. This is because it can not 
only effectively suppress the noise and the reverberation but also retain the 
detailed spectral structure of the desired speech. Moreover, in the case that 
two desired speakers co-exist, CTS-Net trained by multi-speaker can obtain 
the highest PESQ scores. This is because the former one can retain the weaker 
speech components. Note that CTS-Net performs better than GCT-Net in 
most cases. 


3.6.3 ESTOI scores 


Table 3.4 summarizes ESTOI scores of different models in the seen noise 
test set. By comparing the ESTOI scores of GCT-Net and CTS-Net in all the 
situations, it can be seen that CTS-Net can obtain higher speech intelligibility 


Table 3.4 The ESTOI scores of different models in the seen noise test set 


One speaker test set Two speaker test set 
Metrics Teo One speaker | Mix-speaker One speaker | Mix-speaker 
(ms) training set training set training set training set 


Mix | GCT CTS | GCT CTS | Mix | GCT CTS | GCT CTS 
1200 | 30.11 | 55.95 58.93 | 55.84 58.20 | 28.47 | 45.63 46.83 | 47.70 49.27 
1000 | 33.24 | 59.61 62.58 | 59.62 62.16 | 31.98 | 49.28 50.73 | 51.76 53.61 
800 | 39.07 | 63.97 66.90 | 64.19 66.60 | 37.87 | 54.12 55.29 | 56.91 58.63 
600 | 46.92 | 69.12 71.69 | 69.03 71.44 | 45.33 | 59.60 61.07 | 62.91 64.74 
400 | 58.56 | 76.57 78.93 | 76.55 78.64 | 56.65 | 66.66 67.89 | 70.40 71.91 
200 | 77.93 | 86.47 88.14 | 86.45 87.92 | 76.12 | 78.56 79.33 | 82.35 83.58 

O | 93.49 | 96.28 96.45 | 96.20 96.47 | 95.96 | 95.54 95.66 | 97.30 97.49 


ESTOI 
(%) 
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Table 3.5 The ESTOI scores of different models in the unseen noise test set 


One speaker test set Two speaker test set 
One speaker | Mix-speaker One speaker | Mix-speaker 
training set training set training set training set 


Mix | GCT CTS | GCT CTS | Mix | GCT CTS | GCT CTS 
1200 | 30.11 | 54.19 56.91 | 53.69 55.92 | 27.87 | 44.01 45.13 | 46.12 47.62 
1000 | 34.24 | 58.87 61.81 | 58.55 61.01 | 31.07 | 47.19 48.62 | 49.59 51.35 
800 | 37.56 | 61.38 64.20 | 61.29 63.60 | 36.79 | 52.37 53.41 | 55.03 56.65 
600 | 45.34 | 66.76 69.38 | 66.70 68.98 | 44.37 | 57.62 59.02 | 61.04 62.78 
400 | 55.86 | 73.68 76.15 | 73.58 75.64 | 55.09 | 64.52 65.67 | 68.36 69.94 
200 | 74.27 | 83.59 85,39 | 83.44 85.04 | 73.58 | 76.35 76.92 | 80.24 81.37 

O | 89.42 | 93.57 93.66 | 93.42 93.53 | 91.99 | 92.21 91.94 | 94.45 94.60 


Teo 


Metrics (ms) 


ESTOI 
(%) 


improvement than GCT-Net, verifying the advantage of two-stage CNN 
models. Besides, when there is only one desired speaker and the CNN model 
to be tested is GCT-Net, the ESTOI scores are comparable. As for CTS-Net, 
the ESTOI scores of one-speaker training set are higher than those of multi- 
speaker training set, especially when the 7go is relatively large. In the case 
that the number of desired speakers is two, both GCT-Net and CTS-Net can 
perform better if trained by using multi-speaker training set. By focusing on 
the results of CTS-Net, one can see that the ESTOI score is higher when the 
number of desired speakers in the test set matches with that in the training 
set, and this may due to the overfitting problem of deep learning. 

Table 3.5 presents the ESTOI scores of different CNN models and dif- 
ferent training strategies in the unseen noise test set. It can be observed 
that CTS-Net is better than GCT-Net in improving speech intelligibility. 
Besides, when the number of the desired speakers in the test set matches with 
the training set, we can get the highest PESQ scores for both one-speaker 
and two-speaker scenarios. The results as well as the reasons are similar to 
Table 3.4, and the main reason is the same as that of PESQ. 


3.6.4 SDR 


Table 3.6 gives the SDR results in the seen noise test set. It can be seen from 
Table 3.6 that CNN models trained with multi-speaker set can obtain higher 
SDR values in most cases. Besides, when training with multi-speaker set and 
the Teo is relatively short, e.g., less than 0.6 s, the SDR values of CTS-Net 
are higher than those of GCT-Net, indicating that GCT-Net may cause more 
speech distortion. As the Tgo increases, the SDR values of GCT-Net becomes 
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Table 3.6 The SDR values of different models in the seen noise test set 


One speaker test set Two speaker test set 
Metrics Teo One speaker | Mix-speaker One speaker | Mix-speaker 
(ms) training set training set training set training set 
Mix | GCT CTS | GCT CTS | Mix | GCT CTS | GCT CTS 
1200 | -2.96 | -0.25 -0.35 | 0,13 -0.03 | -5.39 | -3.04 -3.99 | -3.23 -3.28 
1000 | -2.05 | 0.52 0.55 | 0.99 0.86 | -4.73 | -3.41 -3.51 | -2.69 -2.85 
800 | 1.80 | 2.30 2.41 | 2.33 2.40 | -3.46 | -2.66 -2.79 | -1.88 -1.92 
Bh 600 | 1.30 | 2.17 2.27 | 2.61 2.53 | -1.69 | -1.42 -1.48 | -0.53 -0.54 
400 | 4.78 | 4.69 5.22 | 495 5.44 | 0.86 | 0.20 0.20 | 126 1.30 
200 | 13.14 | 9.93 10.96 | 10.06 11.07 | 5.57 | 3.89 3.69 | 5.23 5.30 
O | 18.98 | 23.42 24.13 | 23.31 24.34 | 20.47 | 19.34 19.14 | 23.31 24.25 


Table 3.7 The SDR values of different models in the unseen noise test set 


One speaker test set Two speaker test set 
Metrics Too One speaker | Mix-speaker One speaker | Mix-speaker 
(ms) training set training set training set training set 
Mix | GCT CTS | GCT CTS | Mix | GCT CTS | GCT CTS 
1200 | -2.66 | -0.16 -0.07 | 0.18 0.17 | -5.39 | -4.01 -4.05 | -3.34 -3.37 
1000 | -1.63 | 0.84 0.92 | 1.16 1.07 | -4.72 | -3.49 -3.48 | -2.80 -2.90 
800 | -0.65 | 1.30 1.51 | 1.73 1.73 | -3.45 | -2.75 -2.85 | -1.98 -2.01 
a 600 | 1.41 | 244 2.51 | 2.72 2.72 | -1.68 | -1.46 -1.50 | -0.60 -0.61 
400 | 4.54 | 469 5.12 | 492 5.25 | 0.88 | 0.10 0.09 | 1.17 1.21 
200 | 12.72 | 9.47 10.15] 9.51 10.26} 5.56 | 3.71 3.43 | 5.06 5.13 
O | 19.59 | 21.39 21.80 | 21.18 21.81 | 19.65 | 16.82 16.38 | 20.38 20.87 


higher than those of CTS-Net. This is because the residual reverberation 
component of GCT-Net is less than that of CTS-Net. 

In Table 3.7, the SDR values of different CNN models that are trained by 
different sets in the unseen noise set are given. It can be seen from Table 3.7 
that CNN models trained by multi-speaker training set can obtain higher 
SDR values than those trained by one-speaker training set, indicating that the 
former one can cause less speech distortion. Moreover, CTS-Net gets higher 
SDR values than GCT-Net when T¢o is smaller than 0.6 s, but the results will 
be reversed when the Teo is not smaller than 0.6 s. In can be found that this is 
similar to the results in Table 3.6. 

To sum up, we demonstrate the testing results of PESQ scores, ESTOI 
scores, and SDR values for different CNN models as well as different training 
strategies in both seen and unseen noise sets. From these quantitative results, 
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we can see that CTS-Net has better denoising and dereverberation perfor- 
mance than GCT-Net in most cases, indicating the validation of the two-stage 
CNN models. Besides, the CNN models trained by one-speaker training set 
is more suitable than those trained by multi-speaker training set when there 
is only one desired speaker, as it can lead to higher speech quality as well as 
higher speech intelligibility improvement. On the contrary, when two desired 
speakers co-exist, the CNN model trained by multi-speaker training set can 
perform better. Because the differences of the PESQ and ESTOI scores are 
more significant in the latter case, we can conclude that the CNN models 
trained by multi-speaker training set is more robust. Moreover, by testing 
the seen and unseen noise scenarios, the results of all the objective metrics 
seem not to be affected remarkably; this validates the generalization ability 
of GCT-Net and CTS-Net when they are trained with numerous noise sets. 


3.6.5 Subjective Listening Test 


In this section, we utilize AB listening tests for the subjective evaluation, 
whose procedure is similar to [41]. The experiment is conducted at the 
Institute of Acoustics, Chinese Academy of Sciences, and there are 12 audi- 
ologically normal-hearing subjects participating in the listening test. They 
are graduate students or teachers, whose ages range from 25 to 35. When 
conducting the listening test, two groups of speech signals are compared, 
namely the enhanced speech of GCT-Net versus that of CTS-Net, and CTS- 
Net trained by one-speaker set versus that trained by multi-speaker set. 
Twenty pairs of utterances are tested in each group, which are randomly 
chosen from the test set. All the speech signals are played by a PC soundcard 
and reproduced with a circumaural earphone (Sennheiser HD 380 Pro). Note 
that the order of the methods to be compared is shuffled. Before the listening 
test, all the participants are told to distinguish the speech quality and speech 
intelligibility of each pair. When it comes to the second group, they are told to 
focus on the continuity and naturality of both the two speech signals. During 
the listening test, the original noisy speech and the two processed speech 
signals are sequentially provided to each participant. Later, they are asked to 
select their preference utterance. The “Equal” option is provided in case they 
could not distinguish the difference between the two speech signals. 

There groups of subjective evaluation results are shown in Figure 3.7. 
Group (a) shows the preference percentages of CTS-Net and GCT-Net in one 
desired speaker case. It can be seen from the comparison of group (a) that 
the preference percentage of CTS-Net is far larger than that of GCT-Net. 
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one-speaker test dataset 


(a) 


(b) 


multi-speaker test dataset 


(c) 


Preference (%) 


Figure 3.7 Subjective comparison with AB listening test between (a) CTS-Net and GCT- 
Net trained by one-speaker set in only one desired speaker scenario, (b) CTS-Net trained by 
one-speaker set and that trained by multi-speaker set in only one desired speaker scenario, and 
(c) CTS-Net trained by one-speaker set and that trained by multi-speaker set in two desired 
speakers scenario. 


This result indicates that CTS-Net has better denoising and dereverberation 
performance than GCT-Net, namely the two-stage CNN model outperforms 
the one-stage CNN model for speech enhancement. Group (b) compares 
the preference percentages of the CTS-Net trained by one-speaker set and 
by multi-speaker set. Interestingly, one can see that although the CTS-Net 
trained by one-speaker set can get higher PESQ and ESTOI scores than that 
trained by multi-speaker set in the one desired speaker case, as shown in 
Tables 3.3 and 3.5, the preference percentage of the latter training strategy 
is larger than that of the former one. Group (c) shows the result when 
two desired speakers co-exist, where the preference percentage of CTS-Net 
trained by multi-speaker set is larger than CTS-Net trained by one-speaker. 
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This result is in accordance with the PESQ and ESTOI scoring in the 
objective evaluation above. By jointly observing the preference percentages 
of Group (b) and Group (c), we can conclude that the enhanced speech of 
CNN models trained by multi-speaker set are preferred by humans. 


3.7 Discussions and Conclusions 


In this chapter, monaural speech enhancement algorithms with deep learning 
networks are studied, which aim to denoise and dereverberate simultaneously. 
Besides, two scenarios are considered, where single speaker and multi- 
ple desired speakers need to be extracted. At the beginning, we described 
the architectures and mechanisms of the proposed one-stage and two-stage 
CNN-based models, namely GCT-Net and CTS-Net. Then, because multiple 
desired speakers are considered, two types of training sets are generated, 
including the one-speaker set and the multi-speaker set, resulting in two 
types of training strategies. Then, numerous experiments are conducted to 
show the advantages and disadvantages of GCT-Net and CTS-Net on different 
scenarios. 

The experiments reveal that CTS-Net shows better denoising and dere- 
verberation performance than GCT-Net. On the one hand, CTS-Net can not 
only suppress more noise and reverberation but also recover more speech 
components than GCT-Net, which can be observed in spectrograms of the 
enhanced speech signals. Meanwhile, the values of different metrics in objec- 
tive evaluation, including PESQ scores, ESTOI scores, and SDR values, 
as well as the preference percentage in subjective evaluation, indicate that 
CTS-Net outperforms GCT-Net in most cases. 

Moreover, CNN models trained by using different training set have advan- 
tages on different aspects. It can be seen that GCT-Net or CTS-Net trained 
by one-speaker set can achieve more denoising and dereverberation amount 
than that trained by multi-speaker set; on the contrary, the two models trained 
by multi-speaker set can reduce the speech distortion of desired speakers as 
much as possible. Consequently, when there is only one desired speaker, 
CNN models trained by one-speaker set show slightly better performance 
than those trained by multi-speaker set, but when the number of the desired 
speaker increases, the former training strategy may cause significant speech 
distortion. By contrast, the training strategy that uses multi-speaker set shows 
moderate performance in the one speaker scenario and shows much better 
performance in two-speaker scenario, illustrating that this training strategy is 
more robust to multiple speakers scenarios. 
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Abstract 


Emotions are incredibly vital in the mental existence of humans. There are 
a few universal emotions that any intelligent system with finite processing 
resources can be trained to recognize or synthesize as needed, including 
neutral, anger, happiness, and sadness. In the modern era, newline workers 
have lost their interest or concentration in work activity, and these factors 
lead to an effect on the productivity of concerned industries; so it is nec- 
essary to monitor the employee’s emotional activities with the help of IoT 
technology for smart newline industrial environment. Furthermore, despite 
the rapid rise of IoT, in the newline field of the modern industrial era, 
current loT-based systems notably lack cognitive newline industrial, implying 
that they are unable to meet the needs of industrial services. Deep learning 
has become one of the most widely used approaches in various diagnosis 
and predictions of applications and studies. While it is mostly used for 
content-based newline image retrieval, it can still be improved by using it in 
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various computer vision newline applications. From the various conventional 
approaches and their research analysis, new machine learning algorithms 
such as deep learning or CNN-based approaches need to address the existing 
issue by developing an employee’s emotional predictions. The proposed 
research work mentors and counsels workers by monitoring their newline 
behavior in the workplace. The goal of this research was to develop a CNN 
model and controller area protocol-based emotional intelligence system (EIS) 
to automatically categorize expressions in the Facial Expression Recognition 
newline (FER2013) and kaggle databases. The proposed CNN-based EIS 
prediction system achieves 96.75% better efficiency in 36.65-second time 
duration. The proposed system produces better performance results compared 
to the existing support vector machine, decision tree, and artificial neural 
network algorithm. 


Keywords: Facial expression recognition, emotional intelligence, deep learn- 
ing, emotional intelligence system, convolution neural networks. 


4.1 Introduction 


Data mining refers to computer systems that are modeled after the human 
brain and are capable of natural language processing, learning from the 
past, organically cooperating with humans and assisting in decision-making. 
Researchers developed computers that ran at a faster rate than the human 
brain at the turn of the twenty-first century, resurrecting the data mining 
approach (Kalimuthu Sivanantham ef al., 2021). Furthermore, academics 
have begun to utilize the phrase data mining prediction, which combines 
technology and biology in an attempt to reverse engineering. Brain activities 
are the most efficient and effective computers on the planet (Toneva and 
Leila, 2019). The following sections explain the relationship between edge 
computing and CNN classification techniques. 

Figure 4.1 explains the general architecture diagram for IoT data trans- 
ferred using CAN protocol and edge-computing-based classification. CAN 
protocol architecture layer and edge computing take part as analytical, stor- 
age, addressing the quality of services and sensing connectivity; finally, four 
stages of CNN deep learning techniques are explained. The Internet of Things 
(IoT) data prediction refers to the integration of cognition into IoT and all 
the characteristics of loT apply to the cognitive Internet of Things as well 
(Qihui et al., 2014). As a result, an assumption is that the terms “IoT data” 
and “security” are used in the same way in the architecture diagram. Although 
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Figure 4.1 General architecture diagram for loT data transferred using CAN protocol and 
edge-computing-based classification. 


Kevin Ashton, a digital innovation expert, is credited with the initial use of the 
phrase IoT, other groups later defined the term “IoT” based on the widespread 
belief that the Internet's original version was about data provided by people, 
while the next version is about data created by things. Definition of the “Inter- 
net of Things” was generated previously, which was non-trivial, taking into 
account the extensive experience and required technologies, ranging from 
sensing objects, data aggregation and pre-processing, and communication 
systems to object instantiation and finally service provision. However, being 
a worldwide notion, it necessitates a standard definition (Hiller et al., 2018). 


4.1.1 Internet of Things (loT) 


The Internet of Things (loT) is the next wave of innovation, stemming 
from the Internet's and smart gadgets” convergence. Smart things are objects 
equipped with appropriate sensors and actuators, as well as communication 
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technologies such as RFID or NFC. IoT will produce and use the data to help 
people in their daily lives, such as costs, time, and improving optimization in 
any field. The innovative framework for fostering collaboration to operate 
the Internet of Things is still lacking. Smart things unlike computers do 
not have a user to set them to various conditions; instead, they must adapt 
themselves to their environment and respond correctly to events that occur 
around them (Nalbandian, 2015). They will require the right sensors and 
actuators to make the best prediction possible in response to what is going 
on around them. The Internet of Things (IoT) is a diverse and heterogeneous 
ubiquitous network that has been widely explored in the domain of recent 
intelligent or smart services, and it offers a lot of potential and scenarios 
for modern intelligent service applications. Furthermore, there are numerous 
machine learning requests, ranging from web page ranking to collaborative 
filtering to image or speech recognition (Liang et al., 2018). 


4.1.2 Emotional Classification 


Facial or emotional expressions are vital clues for non-verbal communica- 
tions and social interactions among human beings. It is merely conceivable 
since people are intelligent to identify their moods pretty accurately and 
efficiently. Thus, an automatic facial mood or emotion recognition model 
is a vital module in the interaction of humans with machines. Rather than 
the commercial uses of such a system, it would be advantageous to integrate 
some clues as of the biotic neural system. In this proposed model, it is 
used to improve supplementary perceptions into the cognitive or intellectual 
processing capability of the human brain (Javier et al., 2013). 

One of the hallmarks of emotional intelligence, a component of human 
intelligence that has been considered to be even more essential than mathe- 
matical and verbal intelligence, is the ability to perceive emotion. Emotion 
identification according to researchers is the step toward machine emotion 
intelligence (Scherer and Ursula, 2011). Emotion can be recognized through 
a variety of that, including facial expressions, speech, and writing. When it 
comes to emotional intelligence, there are several factors to consider (E.I.). In 
this research, quickly associate it with feelings and declarations of feelings. 
Joyful knowledge is a type of understanding that encompasses the ability to 
perceive and influence emotions (Claudia, 2007). The limit of recognizing 
and conveying the sensation, assimilating it to the idea, understanding and 
prevailing upon it, and having the capacity to regulate it in yourself and 
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others according to the joyful insight. According to some authors, emotional 
intelligence is linked to the observation and processing of emotions because 
humans think and behave based on their life experiences, which are prompted 
by current or past situations (Sivanantham, 2021). 

Machine learning is making significant strides in every industry, including 
medicine, oil and gas, education, energy, weather forecasting, and the stock 
market, to name a few. Machine learning is affecting not only technology but 
also the lives of ordinary people, as evidenced by these actions (Shen and 
Tongda, 2012). Because of machine learning, a regular person has evolved 
into a tech-aware individual, a gadget man. However, existing IoT lacks 
intelligence, implying that it cannot be used for industrial service applica- 
tion requirements. Furthermore, loT is now based on out-of-date stationary 
models and architectures that lack sufficient recent intelligence services such 
as mood recognition and are unable to meet the increasing performance 
demands of companies. It is possible to develop a unique notion of IoT by 
incorporating emotional intelligence concepts into loT (Fatima et al., 2020). 
Cooperative IoT strategies for enhancing enactment and reaching modern 
intelligence loT can detect current network circumstances, assess perceived 
knowledge, make intelligent judgments, and take adaptive actions to improve 
network performance. The Internet of Things (IoT) is a global network 
architecture made up of countless connected objects that rely on sensing, 
communication, networking, and data processing technologies (Francesco 
et al., 2019). Basic universal expressions as proposed by Ekman et al. (1976) 
are listed as follows: 


e Happiness 
e Anger 
e Sadness 
e Surprise 
e Disgust 
e Fear 
Some of the non-basic expressions are as follows: 


e Irritation 

e Despair 

¢ Boredom 

e Panic 

e Shame 

e Excitement 
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Novel paradigms dependent on IoT applications are emerging in a vari- 
ety of industrial applications. However, due to heterogeneous and uncertain 
omnipresent networks, which have a wide range of applications in the field 
of modern intelligent services and discrepancies between offering and appli- 
cation demand, today’s industrial systems face many issues. Accordingly, 
the researchers feel that current technologies and approaches, particularly 
the Internet of Things (IoT), are lacking in cognitive intelligence and hence 
cannot deliver the predicted upgrades and smart industry advancements 
(Zhaozong et al., 2016). As a result, the primary goal of this article is to study 
notions of loT-based settings as well as to review technologies, platforms, 
and developments in cognitive and smart industrial systems. Following that, 
we examined the research problems and open topics to improve knowledge 
accumulation in the field of the Internet of Things (IoT) (Da and Shancang, 
2014). As a result, the driving force behind the efforts is to develop an 
emotional intelligence system capable of recognizing and regulating people's 
facial expressions and emotions during social interactions. As a result, in the 
era Of factual time judgment making, this model will help industry profes- 
sionals track and govern their customer’s and workers’ real-time feelings and 
behaviors (Catherine er al., 2020). Technologies at the core of this impact 
are machine learning and deep learning. These change-initiating agents alter 
the environment in which human beings live and communicate, like the 
industrial revolution. Though still budding, they are getting into our daily 
needs of healthcare advancements, power grid creation, agricultural yield 
improvements, smartphone technology, and climate change monitoring. ML 
algorithms are used to build a mathematical model based on training samples, 
to make decisions in the future about testing samples. ML is a subset of 
artificial intelligence. They are used in various computer vision tasks (El Naq, 
Issam, and Murphy, 2015). 


4.1.3 Applications 


Following some emotional prediction, application applied to improve man- 
power, to reduce the cost of producing goods. Because emotion is our 
biologically innate ability and a part of our evolutionary history, we all 
can read them. It is an ability that gets better from the experience in our 
everyday lives. Because of the prominence of human—computer interface 
(HCD nowadays, understanding the facial visual curves of an individual by 
machine is in need. It can be made well that by understanding the cues of a 
human, a robot can enhance its value to perform its various tasks. 
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This proposed work can be used to evaluate and respond appropriately 
to human feelings by providing empathic responses in the fields of emotion 
handling which is an important aspect of people’s well-being. 


e Preventive medicine 

e Remote monitoring of patient’s physiology 
e E-learning 

e Social monitoring 

e Online games 


In the domains of computer analysis, lie detection, airport security, non-verbal 
communication, and even the importance of expressions in art, the study and 
comprehension of human facial expression have various applications. 

Recognizing a man’s expression can help in a variety of fields, including 
medical science, where a doctor can be notified if a patient is in extreme 
agony while asleep (Rhawn, 1988). It aids in taking quick action at the 
appropriate time. Teachers can tell a lot about a student's learning status 
by looking at their facial expressions. As a result, vision-based expres- 
sion analysis is also useful in e-learning (Maryam and Montazer, 2019). 
A computer vision system can be used to automatically assess learners 
in remote education to discern non-verbal face expressions and determine 
their learning state (Chunfeng, 2016). It is beneficial to both teachers and 
students because it can aid in the improvement of the teaching and learning 
process. Facial reactions reveal a lot about a person’s reaction to a stimulus. 
Facial coding has been a common method in market research and media 
measurement in recent years because 1t may unobtrusively record information 
about an individual’s response at the moment. We can acquire self-reports 
with quantifiable assessments of more unconscious emotional responses to 
a product or service by tracking people’s facial expressions when they give 
comments about it. Market segmentation may be examined and target audi- 
ences can be determined using facial expression analysis to optimize items on 
the market. 

The remainder of this chapter is organized as follows. Section 4.2 dicusses 
emotional prediction using facial expression recognition and related work. 
The planned controller area network (CAN) is discussed in Section 4.3. 
Section 4.4 compares proposed and existing systems’ experimental outcomes 
for IoT emotional data in conventional neural networks. Finally, Section 4.5 
concludes with some thoughts on the proposed edge computing and CAN for 
IoT data classification utilizing convolutional neural networks as well as its 
future scope. 
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4.2 Literature Review 


Smart manufacturing has emerged as a vital engine of research, innovation, 
productivity, and export growth in recent years, because of rapid technologi- 
cal advancements, particularly in cognitive sensor technology. The goal is to 
achieve a new level of productivity, security, safety, and optimizations, as well 
as the transformation of data into insightful and timely information, allowing 
decision-makers across the enterprise to gain new visibility into operations, 
improve their ability to respond to market and business challenges, and 
eliminate operational inefficiencies (Alejandra et al., 2019). 

Shen et al. (2017) have proposed bidirectional LSTM and convolutional 
neural networks (BLSTM-CNN) technique, which could learn the vibrant 
appearances and shapes of areas of the facial units and provisions for vital 
look for the form of face action units. Using the fractious-modality reliable 
deterioration method which reasonably adjusts the CNN model, the models 
for combined graphic written sentimentality investigation of social multi- 
media were designed. The authors also studied a method to construct huge 
scale datasets for image feeling recognition by the deep convolutional neural 
network models. A similar model for effective computing of neuromorphic 
using the deep CNN method was designed. A facial mood recognizer using 
the technique referred to as softmax deeper regression code namely MNIST 
and a 2D convolutional neural network also developed by Chang et al. (2018). 

Using the deep CNN technique, a model for emotion recognitions in 
insignificant data had been established. Consequently, the outcomes showed 
that the sufficient adjusting cascading technique realized improved outcomes 
than a sole phase fine-tuning. Furthermore, a model that achieved higher 
accuracy for substance-free feeling recognition from expressions facials via 
pooled CNN and DBN was developed. An end-to-end speech emotion iden- 
tification system was proposed by integrating CNN and LSTM networks. 
The authors proposed an RNN platform for recognizing feelings in datasets 
or video using the CNN hybrid RNN (CNN-RNN) model. An emotion 
identification model was also proposed in the wild using CNN and mapped 
binary patterns (Meiyin and Chen, 2015). 

Robotized liquid industrial estimation is one of the models that can 
employ IoT organizations to redesign execution and ordinariness over liquid 
industrial regions, increasing the notion of strategy estimates and limiting 
the negative impact using imperative control techniques. An IoT design for 
industrial applications is demonstrated in this chapter. The proposed RSS- 
based security system, which is based on a cloud system and background 


4.2 Literature Review 105 


analysis, collects and sends the essential information. The RSS-based con- 
trol program makes cloud imperative to disclose the information security 
algorithm with current communication problems. Confirmation, information 
security, respectability and fill in this information as a complete guide to 
get the highest level of assurance in the clouds. Performance analysis of 
flowrate measurement is depicted in Table 4.1. As compared with conven- 
tional methods like fuzzy and PID, the proposed method delivers perfect 
results. Imperative, fuzzy, and PID have SSE and settling time values of 
0.048 (seconds), 0.039 (seconds), and 0.049 (seconds) and 21 (seconds), 
26 (seconds), and 32 (seconds), respectively, explained in Kanagaraju and 
Nallusamy (2019). 

As aresult, a deep CNN model might be utilized to promote deep learning 
grounded facial action unit incidence and power approximation. The CNN 
model was also used to construct a strategy for identifying semantics of facial 
features and the extreme sharing native dissimilarity stabilization approach. 
A multimodal emotional state recognition system was built, with a 91.3% 
accuracy rate (Latha and Mohana, 2016). For imbalanced multimedia data 
categorization, a CNN model was combined with a bootstrapping strategy. 
The authors created unique hardware that reduced the amount of time needed 
to train neural networks and CNN, as well as applied it to video analytics 
in smart cities. Chronological deep learning for human action recognition 
was expressed using the 3D CNN and improved recognition accuracy by 
92.17%-94.39%. It was created as a source for EEG-based emotion or feeling 
categorization that would be properly acquired to classify the data using the 
CNN model (Zhang and Dongrui, 2019). From a pool of literature evaluations 
on the convolutional neural network technology, the deep learning technique 
employed contributions and classification, accuracy attained, and limitations. 

The 2D images are constrained to work due to various factors such 
as imaging conditions, pose, etc. Most of the research works in emotion 
recognition heavily depend on the landmarks on the face such as eyebrows, 
mouth, nose tip, etc. Locating the most crucial landmarks on the face that 
contribute to expression is a tedious task. This is because of the face com- 
plexity, and it needs human intervention for better accuracy that results in 
more processing time. Neutral facial scan of the face as a reference for the 
model to fitting or find the displacements of the landmarks to identify the 
expressions (Xiaoguang and Jain, 2008). 

Many manufacturing systems rely on technology, which is merging to 
support and enable IoT applications in the direction of cognitive MS as a 
rising sector. loT architecture, networks technology, software and algorithms, 
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security, trust, reliability, privacy, interoperability, and standardization are 
only a few of the technologies covered (Asma er al., 2016). However, the 
significance of it cannot be emphasized due to many impediments, and 
the issues that manufacturing faces now are not the same as those that it 
faced previously. Many manufacturing processes rely on technology, which 
is coming together to support and enable IoT applications in the direction 
of cognitive MS as a growing industry. Only a few of the technologies 
covered are loT architecture, network technology, software and algorithms, 
security, trust, reliability, privacy, interoperability, and standardization (Asma 
et al., 2016). However, due to several constraints, the significance of it 
cannot be overstated, and the difficulties that manufacturing faces now are 
not the same as those that it faced earlier. For establishing cognitive or 
smart manufacturing systems, bridging the gap between approaches from 
many disciplines is a critical challenge. Manufacturers, too, seek practical 
assistance; yet, the majority of academic research is unrelated to their needs. 
Academics investigate technological frontiers such as artificial intelligence 
and deep learning without contemplating how they will be used in the future 
(Fotis, 2020). Manufacturers want to know what kind of data to collect, which 
sensors to use, and where to place them on the production line. As a result, 
the study is required to find the optimal sensor setups. Open concerns in 
smart manufacturing innovation, for example, include adopting strategies, 
improving data collecting, utilization, sharing, designing predictive models, 
studying generic predictive models, connecting factories, and controlling pro- 
cesses, among others. Because of these characteristics, industrial production 
applications differ from lightweight and centralized monitoring CloT-based 
applications (Long et al., 2019). 

Motivated by the above-mentioned fact, the proposed work aims to embed 
emotional intelligence with machine intelligence in industrial environments 
to access the essential working condition of employees and the behavior of 
employees of the industry in making better decisions about correlations of 
emotional states and the performance. In addition to this, in this proposal, a 
novel approach has been introduced to create a variety of policies to track the 
emotion and monitor the behavior of employees in industrial environments. 
However, as described in the first section of the proposal’s introduction 
section, the proposed project also describes difficult issues that are unique 
for obtaining trustworthy emotional data and collecting a large set of image 
data from employees trying to elicit and experience each of six emotional 
states. To expose the prospects for manufacturing, healthcare, and automotive 
industries to create a set of use cases based on a rough integration of cognitive 
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technologies, cognitive architectures and models, based employees emotions 
prediction with this help to improve their products and performance. 


4.3 System Design 


Facial or emotional expressions are vital clues for non-verbal communica- 
tions and social interactions among human beings. It is merely conceivable 
since people are intelligent to identify moods pretty accurately and effi- 
ciently (Giardini and Michael, 2008). Thus, an automatic facial mood or 
emotion recognition model is a vital module in the interaction of humans 
with machines. Rather than commercializing such a system, it would be more 
beneficial to incorporate certain clues from the biotic neural system into our 
model and use it to improve supplemental judgments of the human brain’s 
cognitive or intellectual processing capability. A DL-based model for improv- 
ing the performance of IoT-based systems that detect human emotional states 
is proposed in this chapter. The concept allows data collected by IoT devices 
to be utilized to detect and analyze employee emotional states. Cécile et al. 
(2016) found that the machine learns the rules of the industry and links 
success with employee sentiment. 


4.3.1 Featured Image Formation 


Instead of using the raw facial images to learn the facial features, we used two 
featured images, as they represent some prominent features at the initial level. 
The process of two featured image formations is explained in the subsequent 
sections. 

The distance between all the landmark points in the referred image and 
the corresponding landmark points in each expressed image is calculated. The 
difference between the corresponding landmark points of both the images 
is the vector of displacement having various feature points. Figure 4.2 
explains the flowchart for the proposed system design feature point extraction 
(displacement) for facial values. 

The largest edge direction magnitude, E, is computed from the provided 
picture J, which has a size of m x n. 


1 1 
Ei j = MAL g=1,2,3 J ) ki; eee ae (4.1) 
l=m—1k=-1 


where i and j are integers ranging from 2 to m — 1 and 2 ton — 1, respectively. 
It is derived from a set of eight Kirsch kernels, k in each of the eight 
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Figure 4.2 Flow diagrams for displacement feature extraction. 


compass directions. It is determined by the mask that produces the largest 
edge magnitude. Let denote the replies obtained after applying each of the 
eight Kirsch Kernels, K;, i = 1, 2, ..., 8, respectively. Assume that maximum 
is the highest obtained edge magnitude. 


7 
LDP = d(kirsch— kirschmar) * 2 (4.2) 
I=0 
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where 


otherwise 


wa ={ ai mn (4.3) 


The image is robust since it is formed by taking into account all eight 
directions. LBP is an image operator that converts an image into an array or 
image of integer labels, as previously mentioned. These labels or their statis- 
tics (most commonly, the histogram) are used for further picture processing. 
In this case, a modified LBP approach is applied. 

The input image is separated into many fixed-size blocks. In a 3 x 3 
window, each pixel’s eight neighbors (top, bottom, left, right, top left, bottom 
left, top right, and bottom right) are determined. Instead of comparing the 
value of the central pixel with the intensity values of the eight neighbors, the 
average intensity values of the neighbors are computed and compared with 
the intensity values of the eight neighbors. If the condition is met, the value 
is 1, or else it is O. This yields an eight-bit binary number, which is used as a 
feature. The neighbors can be processed in either clockwise or anti-clockwise 
directions. 

The generated binary numbers (also known as LBP codes) represent the 
image’s local primitive properties. They show the many forms of curving 
edges, spots, flat sections, and so on. Figure 4.3 explained the facial image 


11110101 


Figure 4.3 Facial images to feature extracted binary value. 
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converted into binary futures details. The proposed study uses a neural net- 
work to classify Indian facial traits, using a multimodal feature fusion dataset. 
Using CNN-based training and recognition, these characteristics were used to 
classify the signals. According to the psychological review, there is a need to 
address the fusion or integration of multimodal emotional data for optimal 
man-machine communication. It is because the human sensory system can 
analyze and extract accurate information about an individual’s emotional state 
from a collection of face traits. 


4.3.2 CNN Classification 


The suggested deep learning framework for employee emotional prediction 
architecture is depicted in Figure 4.4. Using a typical neural network, the face 
regions such as the left eye, right eye, and mouth are extracted and trained. 
On the face images, histogram equalization, rotation correction, and spatial 


Industry Employee 
Facial input image 


Figure 4.4 Proposed deep learning framework for employee's emotional prediction 
architecture. 
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normalizing are conducted to obtain representable features. The ultimate 
result of expression recognition is determined via majority voting of the three 
CNN’s results, using a decision-level fusion procedure. 

Image creation, pre-processing, feature extraction, and feature classifica- 
tion for emotion recognition are among the five processes in this architecture. 
Furthermore, the rigorous theoretical and practical analysis to be carried 
out to analyze six physiological signals of employees in industrial environ- 
ments which exhibit a problematic day-to-day variation of employee faces 
or facial expressions helps to detect and interpret the emotions (Gouizi 
et al., 2011). On the same day, the traits of distinct emotions tend to cluster 
more densely than the features of the same feeling on different days. A 
fresh emotional intelligent method must be created to deal with the daily 
variances. 

Furthermore, the proposed regulations are not limited to monitoring the 
conduct of industry employees; they are easily applicable to stressed indi- 
viduals working in a multinational firm who can be mentored and counseled 
by the company’s manager. Emotional intelligence policies can be applied to 
a variety of settings, including homes, offices, conference rooms, hospitals, 
control centers, and retail, among others. There is plenty of evidence that our 
visual processing architecture is divided into layers. Each stage alters the data 
in a way that makes the visual task easier to complete. The ability to share 
features or sub-features is another interesting feature of deep learning models. 
It has also been demonstrated that insufficiently deep structures might be 
exponentially inefficient in terms of computation. When they developed a 
very effective method for training multilayer neural networks (Zhong et al., 
2017), deep learning was revolutionized. Figure 4.5 explains that proposed 
systems consist of three units, such as input, processing unit, and output 
unit. The employee details psychological input battery input, ground, and 
camera used to receive the facial input of employees designed in the input 
unit. These details are transferred using the controller area network (CAN) in 
the microcontroller and converted details are processed in the processing unit; 
these values are stored in EEPROM memory. The final values like features are 
stored in the output section. 

Figure 4.6 explains the input values, configuration, and facial message 
signal are transferred CAN communication protocol to the controller to cal- 
culate the facial signal information is stored as a feature to the output memory 
storage devices. Figure 4.7 explains the CAN communications for signal 
stored and message signal attributes, and the number of bytes consumed in 
each signal for employ emotions prediction. Totally, six signals are processed 
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Figure 4.6 CAN bus bar functionality configuration emotional indications. 


and sent to the CAN communications protocol with the number of bytes 
consumed, and transfer bit rates are also defined. 

Figure 4.8 explains the analyzer and DBCAN+ software tool used to 
manually check the signal before hardware design. In this tool displayed, 
each of the employee’s signals in Figure 4.5 motioned was transferred and 
displayed in color representation. Each color defines the different signals of 
employees’ signal monitoring in concerns. 
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Figure 4.7 CAN bus bar message and signal design configuration for emotional indications. 


The CAN transferred signals are stored and passed through the following 
CNN based approaches. The following CAN data received from humans such 
as pressure, facial emotion using controller area network (CAN), data was 
stored and applied to the seven categorizations emotional AlexNet is a first 
successful CNN for a large training dataset, ImageNet (Tom et al., 2014). Its 
layout is depicted as relatively simple as compared to other recent architec- 
tures. It has five convolutional layers, three fully connected layers along with 
max-pooling, and dropout layers. The traditional activation function Tanh is 
given by the following equation: 


1 
* Ife% 
This Tanh function is a saturating nonlinearity, which is slow to train as 
the exponential function is computationally expensive. So, a no saturating 
nonlinearity, ReLU, as given by eqn (4.5), is used by AlexNet, which is faster 
to train and computationally efficient (Krizhevsky et al., 2012). 


f(x) = Tanh(z) = 2 (4.4) 
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Figure 4.8 CAN signal transmission for emotional indications. 


f(z) = ReLU(x) = max(0, x). (4.5) 


ReLUs are said to be six times faster, and they do not need input nor- 
malization to prevent saturation. But a local response normalization called 
brightness normalization is done to achieve generalization. This normaliza- 
tion results in a form of lateral inhibition, which is found in real neurons. 
AlexNet uses ReLU instead of Tanh for adding the nonlinearities, which 
boosts up the speed for the same accuracy level. To deal with the overfitting 
issue, it uses dropout. To reduce overfitting to the training data and to 
learn robust features, with a probability of 0.5, the output of each hidden 
neuron is set to zero. Because the dropout operation doubles the number 
of repetitions necessary to converge, it is only used on the first two fully 
linked layers of the AlexNet design. The neurons that are dropped out will 
not participate in forwarding pass and, hence, in backpropagation. Thus, it 
reduces the co-adaptations of neurons, as each neuron cannot rely on other 
neurons. The pooling layer is used to summarize the neighboring neuron’s 
output in the same kernel map. To reduce the size of the network, it overlaps 
the pooling layers. AlexNet models were trained using stochastic gradient 
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descent that minimizes the cross-entropy loss function with particular values 
for momentum as well as weight decay. All the layer’s learning rates are first 
set to 0.01 and then manually modified when the validation error rate stops 
increasing versus the current learning rate (Shanthi and Sabeenian, 2019). 
Here, we trained a single CNN for the two different featured images, instead 
of multiple CNN models. This single network training can save both time and 
memory, thus preserving accuracy. There is no need to learn different weights 
for different networks that correspond to score-level fusion strategy. Instead, 
we adapted a feature-level fusion strategy. Compared to other structures of 
deep learning, CNN is commonly used as it demonstrates better performance 
in recognition tasks due to its ability to extract and learn robust features. 
Moreover, CNN uses a small amount of bias and weight values as compared 
to other structures of deep learning. 

The algorithm combines the original image with the virtual image, which 
is the outcome of the original image’s nonlinear modification. Pixels with a 
moderate intensity are boosted, whereas pixels with high or low intensity are 
altered. For both the test and training sets of photos, the distance between 
the original image and the virtual image is computed using statistical metrics. 
Furthermore, in expressing face images, the value of distinct pixels fluctuates. 
As a result, different weights might be assigned to different pixels. The weight 
is then calculated using the two least distance values, yielding an adaptive 
weight that is different for each image. Then, to recognize the faces, score 
level fusion is used, which overcomes the illumination caused by the image’s 
lighting. According to the findings of the experiments, the proposed method 
outperforms earlier work in recognizing accuracy. 


4.4 Result and Discussion 


The construction and implementation of convolutional neural networks to 
categorize seven fundamental emotion types from face picture datasets in 
employees were presented in this study. The proposed system was created 
in Matlab2019b with the IDE, which includes deep learning packages and 
libraries such as AlexNet. We used CNN to build our model, which was 
assisted by an embedded camera backend and a PCA feature extraction 
module. The face image can be classified into one of the following emotions 
using a neural network system that includes convolution layers, pooling, 
and fully connected networks: neutral, angry, disgust, fear, sad, surprised, 
or joyful. Preparation of datasets The dataset, the facial expression image 
from the Facial Expression Recognition newline (FER2013), and the kaggle 
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Figure 4.9 Input images used in the employee emotional predictions. 


image database was organized and extracted (Choi and Byung, 2020). The 
test image from the simulation is obtained in the form of JPG format with 
the range of 512 x 512 pixels captured from a real-time HD camera and 
applied in the experiments. Furthermore, our model had also been constructed 
to recognize the emotions of human beings from a live video. The several 
multiplicative and additive noises can be reduced using the denoising work, 
and also the filtered image conception is enhanced to a very great extent. 
Figure 4.9 explains the proposed system described image in various types 
such as happy, fear, guilty, disgust, normal, pain, and surprise dataset images. 
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Table 4.1 Employee emotional prediction performance results 


Methods Efficiency Precision Error rate 
Support vector machine 79.35 85.60 14.4 
Decision tree 81.25 89.70 9.3 
Artificial neural network 86.50 90.85 9.15 
algorithm 
Proposed convolutional 96.75 95.65 4.35 
neural network system 


The type of parameters taken from the facial image determines the per- 
formance of a neural network or CNN. The performance is also influenced by 
how the parameter data is processed before being presented to the networks. 
The system created a model with 25 features and 19 facial points based 
on frontal photos of the face and 10 points based on profile images of the 
face. The geometric face model based on 30 feature characteristic points has 
now been completed. The identification rate of most other methods, such 
as feature point tracking, Gabor wavelet analysis, and optical flow tracking, 
was comparable to or slightly better than the identification rate of the seven 
real-valued and eight binary parameters utilized in the study. 

Negative, positive, or no significant divergence from the neutral value was 
seen in real-valued parameters. For diverse expressions, the trend of variation 
of different parameters concerning neutral values aids in the effective training 
of neural networks to recognize specific expressions. Each expression is 
defined by the real-valued and binary parameters (Table 4.1). However, for 
some expressions, some parameters do not show a significant variation from 
neutral value, and, hence, they do not contribute to detecting that expression. 
Employee images in Figure 4.7 were utilized, and it was determined that the 
mean average % for all the parameters, as well as the average percentage 
existence of all binary parameters for all seven expressions, deviates from 
their respective neutral values. 

Table 4.1 explains that the model based on proposed CNN yields the 
maximum efficiency, precision, and minimum error rate for Facial Expression 
Recognition newline (FER2013) dataset of (96.75, 95.65, and 4.35), at nearer 
neural network algorithm (Khatun and Turzo, 2020) provides only 86.50 of 
efficiency, 90.85 of precision, and 9.15 of error rate values. 

Comparatively, the proposed approach provides a better result in the 
different terms among 5.25, 3.80, and 4.49 values. The SVM offers the least 
efficiency values of 79.35%, precision of 86.5, and error rate values of 14.4 
(Healy et al., 2018). The decision tree offers the least efficiency values of 
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Figure 4.10 Employee emotional prediction performance results. 


Table 4.2 Employee emotional prediction time performance results. 


Methods Support Decision Artificial neural Proposed 
vector tree network convolutional 
machine algorithm neural 
network 
system 
Time in seconds 38.25 48.52 56.55 36.65 


81.25% precision for 89.70 and error rate values of 9.3 (Jamil and Hamzah, 
2018). Whereas, the proposed CNN yields better quality matrix values for 
the Facial Expression Recognition newline (FER2013) dataset as explained 
in Figure 4.8. 

Table 4.2 explains the model based on the proposed CNN algorithm 
time taken efficiency, precision, and minimum error rate for Facial Expres- 
sion prediction Recognition newline (FER2013) dataset of (36.65 seconds), 
decision tree algorithm achieves efficiency of 81.25% for 48.52 seconds 
taken, support vector machine algorithm achieves 79.35% efficiency in 38.25 
seconds and artificial neural network algorithm achieves 86.50% efficiency in 
56.55 seconds. Comparatively, the proposed approach provides a better result 
in terms of time: 1.6 seconds in SVM, 11.87 seconds in decision tree, and 19.9 
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Figure 4.11 Employee emotional prediction time duration performance results. 


seconds in artificial neural network algorithm. Whereas, the proposed CNN 
yields better quality matrix values for the Facial Expression Recognition 
newline (FER2013) dataset explained in Figure 4.9. The quality efficiency 
values subjected to the suggested CNN are superior to those exposed to SVM, 
DT, and NN. The proposed CNN system removes unrelated and superfluous 
structures from the data, and the structure chosen will aid in the enhancement 
of the presentation of learning models if the data is reduced for classification. 

The experimental results of the proposed systems are analyzed with 
different databases and it has been observed that the proposed methods show 
significant improvement in genetic face recognition quality and intelligibility. 


4.5 Conclusion 


In image processing applications, face recognition is a promising area of 
research. For security purposes and in a variety of applications such as 
biometric authentication, human—computer interaction, video surveillance, 
credit card verification, automatic indexing of images, and criminal iden- 
tification, a facial recognition system can verify or identify an individual 
from a digital image. The proposed method allows for a model-based safety 
framework for industrial automation systems, allowing management to better 
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understand employees’ current feelings about their work. The current method, 
however, is limited to discrete-time models. This issue will be addressed in 
this paper by using a CNN approach to simulate the continuous dynamics of 
monitoring in conjunction with the discrete character of the control logic. It 
discusses a reconfigurable smart WSNs unit based on the Internet of Things 
for monitoring technical safety factors in the workplace. The system can 
intelligently collect sensor data. It was created with the use of wireless com- 
munication in mind. It is well suited to the real-time and practical needs of a 
high-speed emotional data gathering system in an IoT setting. Emotions are 
recognized automatically by facial emotion recognition systems. Healthcare, 
tailored learning, robotics, event detection, and surveillance are just a few of 
the applications. On the other hand, major facial traits that are discriminative 
have been identified, and this continues to be a challenge because each 
emotion has its variability and nuance that must be represented. The primary 
goal of this project is to address these issues. When characteristics involving 
motion sequence are evaluated, an appropriate alternative is an optical flow, 
and the attributes of PCA are the histogram of motion orientation, which are 
weighted through the amplitude of motion. 

This research’s future focus will be on this issue, and it will offer a 
system that will considerably improve emotional identification for a larger 
number of images and with a longer linage of face images. Finally, this study 
could be expanded by employing a larger number of face photographs from 
various perspectives to demonstrate an improvement in the performance of 
identifying emotions in faces. Also, the future of industrial management will 
be centered on meeting climate change expectations, safeguarding human and 
machine safety, and measuring and improving performance. Though no one 
predicts that safety engineers or industrial hygienists will go out of business 
in the next decade, our respondents are almost unanimous in their belief that 
businesses will continue to need professionals who can handle a wide range 
of industrial functions and even allied responsibilities. 
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Abstract 


According to the World Health Organization, there are millions of visually 
impaired persons throughout the world who struggle to move freely. They 
always need help from people with normal sight. For visually impaired 
persons (VIP), finding their way to their desired destination in a new area is 
a huge difficulty. This research aimed to assist these individuals in resolving 
their challenges in moving to any place on their own. To this end, we proposed 
a method for VIP using a convolutional neural network (CNN) to recognize 
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the condition and scene objects automatically. The proposed system consists 
of Arduino UNO, ultrasonic sensors, a camera, breadboards, jumper wires, a 
buzzer, and an earphone. Breadboards are used to connect the sensors with 
the help of Arduino UNO and jumper wires. The sensors are used for the 
detection of obstacle and potholes while the camera performs as a virtual 
eye for the visually impaired people by recognizing these obstacles in any 
direction (i.e., front, left, and right). An important feature is provided by this 
system, in which the blind receives the scene object, the system automatically 
calculates how far he is away from the obstacles, and a voice message alerts 
him and directs him via earphone. The obtained experimental results show 
that the CNN yielded impressive results of 99.56% accuracy and has a loss 
validation of 0.0201 %. 


Keywords: Convolutional neural network, Arduino UNO, ultrasonic sensors, 
wayfinding system, situation awareness, activity instruction. 


5.1 Introduction 


According to the World Health Organization (WHO), at least more than 
two billion people have sight problems, with 1 billion avoiding or leaving 
untreated [1]. The world faces numerous challenges in the field of eye care, 
including treatment, a shortage of trained eye care providers, and a lack of 
integration of eye care into the healthcare system. The study was released on 
World Sight Day to alert the public about the increasing number of vision 
deficiency and blindness problems caused by numerous eye disorders such 
as presbyopia, which affects 1.8 billion people and occurs at a young age, 
myopia, which affects 2.6 billion people, cataract, which affects 65.2 million 
people, and corneal opacity, which affects 6.9 million people [2]. A visually 
impaired person faces numerous challenges that necessitate the assistance of 
a sighted person for him/her to find his/her way. Because of the unfamiliar 
environment, blind people are unable to find their way. VIPs typically use 
a walking cane. Raspberry Pi, Arduino UNO, and ultrasonic sensors can 
be used to assist visually impaired people. Arduino UNO technology was 
employed in this system to design a smart blind device. The design not only 
provides detection of barriers but also guides them along from a specific 
range to locate themselves easily. The ultrasonic system emits energy waves 
that reflect from obstacles on any side (i.e., left, right, and front) to assist the 
visually impaired person in detecting the obstacle within the defined range. 
The distance between visually impaired people and objects is calculated using 
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the ultrasonic sensor’s starting and ending pulses. A buzzer sensor is used 
to alert the blind user to potential hazards. If the obstacles are too close 
to the blind user, the proposed system will generate a voice message and 
activate a buzzer to alert the visually impaired person to the obstacles. With a 
voice message, this proposed design can provide full support against obstacle 
avoidance. The proposed module will be beneficial to people who are blind. It 
is easier for them to find ways in daily activities that do not require the use of 
the standard mobility aids and are available to individuals with this disability. 
Mobility aids are less expensive, smaller in size, and easier to transport. The 
remaining of the chapter is divided into five sections: Section 5.2 contains a 
literature review, Section 5.3 contains the details of the proposed methodol- 
ogy [convolutional neural network (CNN)], Section 5.4 contains study and 
discussion of the experimental results, and Section 5.5 concludes. 


5.2 Literature Review 


Many attempts are being made all over the world to propose designs that will 
assist visually impaired people in detecting obstacles using various electronic 
technologies. Some of these works are focused on microcontrollers, and some 
of them use the technology of the global positioning system (GPS) and global 
system for mobile communications (GSM), but the majority of them use 
ultrasonic sensors to detect obstacles, according to a literature review. All 
of these efforts are focused on assisting blind people in detecting obstacles, 
rather than assisting them in navigating on their own. Batavia et al. [3] suggest 
a distance-dependent method based on a camera, with background motion 
measured using homo-graphic renovates. The barriers are classified as critical 
or normal based on their proximity to a particular distance, after which they 
are detected and identified. Hesch et al. [4] proposed that a design model has 
been consisting of a two-dimensional laser scanner, foot mounted pedometer, 
and three axis gyroscopes for the aid of VIP in an inside environment. They 
offered two-layered estimators in the first estimated layer, the location of the 
blind cane was monitored, and the second layer was estimated to determine 
the person’s location. Pradeep et al. [5] suggested an environment perception 
device that uses an RGB-D camera. In the proposed system, they impose three 
things: self-localization, obstacle detection, and object recognition. In self- 
localization, the depth has been perceived based on the tracking technique 
on color information. Obstacle detection and recognition provide meaningful 
information to the visually impaired people and recognized the obstacle 
such as stairs, walls, vehicles, doors, etc. Rodriguez et al. [6] proposed a 
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method for obstacle prevention devices to help visually impaired persons. 
They have implemented an incremental map of the environment with the 
help of optical SLAM techniques, to provide spatial direction and location 
of the visually impaired people at the same. The proposed design also 
provided audio feedback to alert a blind person to the existence of potential 
impediments. Tapu ef al. [1] designed smartphone-based obstacle to detect 
and classify to VIP to walk freely and carefully in an inside and outside 
environment. The Lucas Kanade algorithm has been used to extract the 
feature from images. Shahdib et al. [7] designed a model that involves the 
head of mounted stereo camera to search ground planes and break up six- 
degrees-of-freedom into ground smooth and planar motion which assist VIP. 
Due to the investigation of the variation range, they evaluated the ground 
smooth using optical/visible data or with the IMU reading. Maidenbaum et al. 
[8] designed a system, based on ultrasonic sensors and a camera for obstacle 
detection and recognition, where the camera has been used to recognize and 
measure the size of obstacles. For VIP, Leung er al. [9] suggested a helmet, 
audio-based guidance system. They have used visible odometry and feature- 
based SLAM in the proposed design system to make a three-dimensional map 
for obstacle detection. Xiao [10] set out a system for vision-impaired people 
to use for context-aware navigation services. To incorporate sophisticated 
intelligence into navigation, the user must be aware of the semantic qualities 
of the things in their surroundings. This interaction is critical for improving 
communication about things and places so that better travel decisions may 
be made. Suba Raja et al. [11] created a system for visually impaired people 
that use mobile computing to detect and recognize their faces. This mobile 
system is supported by a server-based support system. Vlaminck et al. [12] 
used three ultrasonic sensors and a microcontroller to detect the object 
range. The audio and vibration systems also have been used to warn visually 
impaired people to avoid obstacles. Eunjeong et al. [13] proposed a novel 
mobile navigation device that can identify the condition and scene objects 
during walking time. The proposed framework classifies a user’s current 
condition in terms of their position by analyzing streaming images. Then, 
using computer vision techniques, only the appropriate background objects 
are identified and interpreted based on the current situation. Ramadhan et al. 
[14] presented a wearable intelligent system to help VIP to go along the roads 
themselves, navigating in public places and seeking assistance. The system 
tracks the path and alerts the user to any obstacles using a set of sensors. 
The user is warned about the sound of a bolt and vibrations on the wrist, 
which can be useful in a noisy setting or when the user has a hearing loss. 
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Table 5.1 Evaluation of previous reviewed systems with some limitations 


System name Type of the Disadvantage Techniques 
sensors 

Smart cane Water If the water is less Ultrasonic 
ultrasonic than 0.5 inches deep, 
sensors the water sensor will 


not detect it, and the 
buzzer will not stop 
until the water is 


completely dry. 

Eye substitution Vibrator motors | For haptic feedback, Ultrasonic with 
with two the team employed GPRS, GPS, and 
ultrasonic three motors. To offer | GSM 
sensors more detailed 


feedback, they may 
use a 2D array of 
these actuators with 
restricted use by 


Android 
smartphones. 
Ultrasonic Ultrasonic Cannot detect items Ultrasonic 
navigation cane sensor with that come out of 
Arduino nowhere 
Assistive Four ultrasonic | Only a few guidelines | Ultrasonic 
ultrasonic headset sensors are given, and the 


headgear muffles 
outside sounds. 


Navigation system | Three axial Because of its GPS technology 
for outdoor accelerometer restricted range, the 
assistive magnetometer GPS receiver must be 

sensors connected through 

(AAMS) Bluetooth to work. 
Obstacle Kinect’s depth When the distance Auto-adaptive 
avoidance using camera between source and thresholding 
auto-adaptive target increases, the 
thresholding accuracy of the 


Kinect depth image 
diminishes. After 
2500 mm, the 
auto-adaptive 
threshold was unable 
to distinguish 
between the floor and 
the item. 
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Table 5.1 Continued. 


for visually 
impaired people 
using dynamic 
vision sensors 
(DVS) 


testing is needed to 
demonstrate the 
performance of object 
avoidance and 
navigation strategies, 
as the test focused 
mostly on item 
identification in the 
scene’s center region. 


System name Type of the Disadvantage Techniques 
sensors 
A mobility device DVS Further intense Event-based 


Obstacle 
avoidance using 
laser rangefinder 
(LR) 


Novint Falcon, 
photo sensors, 
and 
supplementary 
sensors 


It was difficult to 
pinpoint the exact 
position of barriers 
and angles. 


Laser rangefinder 


A path force 


Kinect sensor 


This design’s 
detecting range is too 
short, and the user 
must be educated to 
distinguish between 
vibration patterns for 
each cell. 


Infrared and GPS 


Artificial vision 


Optical sensors 


Assure its 
performance, the 
technology has not 
been tested or 
connected with 
navigation systems, 
and it is uncertain 
whether it will 
improve navigation 
systems as claimed 
by the creators. 


GPS and GIS 
vision-based 
positioning 


Guidance system 


Kinect sensor 


The system’s 
detection of spatial 
markers must be 
improved, as well as 
the stability of the 
rebuilt walking plane. 


Stereo vision, 
canny filter, 
vanishing point, 
and fuzzy rules 


Navigation-based 
system 


Monocular 
camera 


Their predetermined 
image sizes based on 
category may make it 
difficult to recognize 


RANSAC 
algorithm, HOG 
descriptor, 
BoVW 
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Table 5.1 Continued. 


functionality when 
used outside or 
during the day. 


System name Type of the Disadvantage Techniques 
sensors 
the same item at vocabulary 
different sizes. development, and 
SVM classifier 
Vision assistance RGB-camera The infrared’s RANSAC 
using sensor sensitivity to sunlight | detection 
RGB-camera can have a algorithm with 
detrimental impact on | infrared 
the system’s technology and 


density images 


Ultrasonic sensors 
for obstacles 


Four ultrasonic 
sensors 


Above the waist level, 
the system is unable 


Ultrasonic 
sensors and 


detection to identify vision-based 
impediments. There obstacles 
is no information detection with 
about how to get SVM classifiers 


about. The detecting 
range is limited. 


Nivedita et al. [15] designed an electronic aid device that includes a Rasp- 
berry Pi device, an ultrasonic sensor, a webcam microphone, and a light 
dependent resistor (LDR) sensor. For obstacle detection, an ultrasonic sen- 
sor with a camera is used. The LDR sensor has been used to detect the 
brightness of the environment to determine whether it is dark or bright. The 
proposed system detects objects in the environment for visually impaired 
people and provides feedback by voice using an earphone. Souza er al. 
[16] presented a Raspberry Pi based system for blind people with unique 
features such as tracking objects and sending feedback via voice, as well as 
giving environmental information. The most crucial aspect of this work is 
the tracking of blind person’s location and notifying the caretaker to ensure 
safety. Wafa Elmannai et al. [17] presented model comparison of wearable 
and portable assistive equipment for people with visual problems to demon- 
strate the progress of this group of people in assistive technology. Charis 
et al. [18] developed a new smart assistive systems framework for VIP that 
employs a user-cantered design method to evaluate a variety of operational 
and optional system characteristics. The outcomes of a series of interviews 
and surveys with visually impaired and non-visually impaired persons were 
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analyzed to create the system’s criteria for both on-site and distant usage. 
Yosra et al. [18] helped these communities to overcome their difficulties in 
moving independently to any location. This research specifies three paths for 
visually impaired students in the University of Gezira to easily reach different 
places on campuses. 

The data acquired from the surrounding environment (through laser scan- 
ners, detected camera, or sonar) and sent to the user via tactile, auditory, and 
most electronic devices that provide services for VIP employ either or both 
methods. Different perspectives on whether approach gives greater feedback 
are currently being debated. Furthermore, despite the fact that numerous 
systems have been presented over the last decade, none of them is regarded 
a comprehensive solution capable of assisting VI individuals in all parts of 
their lives. As a result, this chapter highlights some of the work that has been 
completed. 


5.3 Proposed Methodology 


This research aims to detect and recognize the obstacles for VIP based on 
the deep convolutional network. The operation algorithm for detecting the 
obstacles mechanism in the paths is shown in Figure 5.1. Also, Figure 5.1 
depicts the process for choosing the desired path as well as the steps involved 
in getting to the correct area. If the blind face up with obstacles, the device 
will keep track of how far he has strayed and warn him to change to a path. 


5.3.1 Assistive Algorithm 


An assistive algorithm describes the destination of a visually impaired user. 
Every obstacle (i.e., front, left, and right) is recognized in every direction and 
each obstacle is formatted as a statement of instructions. Thus, it is necessary 
to define the action first and then estimate the corresponding parameters, e.g., 
counts step, direction, current position, etc. The ultrasonic sensor information 
is used to calculate those parameters for this module. 


5.3.2 Data Acquisition 


The dataset was created using recorded videos with an Arduino UNO camera, 
and each frame was extracted with a width and height of (640,480). The key 
rule of deep learning is to split the data into two phases: training and testing, 
with 70% of the dataset being used for training and 30% for testing. 
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Activate Sensor 


Generate Voice 


Figure 5.1 Framework of the proposed system. 


5.3.3 Device Architecture 


Figure 5.1 shows the design block diagram, which includes an Arduino UNO 
earphone, ultrasonic sensors, camera, breadboard, jumper wires, buzzer, and 
an earphone that comprise the device architecture. A breadboard is a critical 
component in the construction of a circuit. The breadboard acts as a link 
between the sensors and the Arduino UNO. The jumper wires are used to 
connect the sensor to the Arduino UNO indirectly. The Arduino UNO has 
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Algorithm 1: Assistive Algorithm 


Input: Ultrasonic sensor U, Arduino UNO, Camera, Buzzer 
Output: Contains the set of instructions, I (Turn, Count, Direction, Current position) 
1. Start 
2; Initialize A — null, Direction > null, step Count > null, current 
position (x, y) > null; 
If Ultrasonic sensor U <= 200 cm, then Stop 
Activate Buzzer 
Else if Current Direction — Previous Direction > 250 > Turn 
Else Count — Continue A — Stop Buzzer 


If count A — Continue then Current position (x, y) 


RAY A O 


Count — Count + 1, x — Count. Cos current Direction, 
y > Count. Sin Current Direction 


9. Else If A — Turn, then previous Direction — Current Direction 
10. If position — Destination — Terminate 
11. Else Go step 3 


a power bank device mounted on the top to provide a specific amount of 
power when compared to a full-fledged desktop PC. The Arduino UNO has 4 
ports, 40 GPIO pins, a memory card, a camera interface (CSI), and an HDMI 
port. Jumper wires are used to connect the sensors to the Arduino UNO. To 
power the Arduino UNO, a power bank is mounted on a wooden cane. The 
four ultrasonic sensors are linked to an Arduino UNO, which requires a 5- 
V power supply. Three of the four ultrasonic sensors detect obstacles from 
three directions (i.e., front, left, and right), while the fourth detects potholes. 
The visual sensor has a resolution of 5 megapixels and is directly connected 
to the Arduino UNO via the camera port to detect obstacles from the front. 
The buzzer sensor is linked to the Arduino UNO to alert the blind person to 
obstacles that require three voltages of power. The earphone is used for audio 
feedback and transmits a voice message to warn the VIP of the existence of 
obstacles. It also calculates the direction and distance from the obstacles. 


5.3.4 Arduino and Its Interfacing 


On the Arduino UNO, there are 14 digital input/output pins (six of which may 
be used as PWM outputs), six analog inputs, a 16-MHz crystal oscillator, 
a USB connection, a power connector, an ICSP header, and a reset button. 
It comes with everything you need to get started with the microcontroller; 
simply plug it into a computer via USB or use an AC-to-DC converter or 
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Figure 5.2 Arduino UNO. 


battery to power it. The UNO varies from prior boards in that it does not use 
an FTDI USB-to-serial driver chip, instead using an Atmega8U2 intended 
to operate as a USB-to-serial converter. The term “UNO” is derived from 
the Italian word “UNI,” meaning “one.” It was chosen to commemorate 
the upcoming release of Arduino 1.0. The Uno and version 1.0 will be the 
reference versions of Arduino in the future, while the Arduino UNO is the 
newest in a series of USB Arduino boards that serve as the platform’s standard 
model for comparing past iterations. 


5.3.5 Power 


The Arduino UNO may be powered through USB or an external power 
supply, and the power source is chosen automatically. External (non-USB) 
power can be supplied via an AC-to-DC adaptor (wall-wart) or a battery. To 
connect the adapter, a 2.1-mm center-positive plug may be inserted into the 
board’s power connector. The power connection’s Gnd and VIN pin headers 
can be inserted with battery leads. The board may be powered by an external 
source ranging from 6 to 20 V. If less than 7 V is supplied, the 5-V pin may 
deliver less than 5 V, causing the board to become unstable. If more than 12 
V is utilized, the voltage regulator may overheat and kill the board. 

VIN is the Arduino board’s input voltage when utilizing an external power 
source (rather than 5 V via a USB connection or other regulated power 
source). If power is supplied via the power connector, this pin can be used 
to supply or access voltage. The CPU and other components on the board 
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Figure 5.3 Arduino pin description. 


are powered by a 5-V regulated power source. This can come from VIN or 
another controlled 5-V source, or it can come from USB or another regulated 
5-V supply. The on-board regulator generates a 3.3-V supply. The maximum 
current usage is 50 milliamperes. 


5.3.6 Memory 


The Atmega328 is equipped with 32 KB of flash memory (0.5 KB for the 
boot loader), 2 KB of SRAM, and 1 KB of EEPROM for code storage (which 
can be read and written with the EEPROM library). 


5.3.7 Deep Convolutional Neural Network (CNN) 


Deep learning has proven to be a particularly effective technology in recent 
decades due to its ability to handle large amounts of data. Hidden layers 
have eclipsed traditional approaches in popularity when it comes to pattern 
recognition. Convolutional neural networks (CNNs) are a common type of 
deep neural network. Researchers have been working on a system that can 
recognize visual input since the 1950s, when artificial intelligence was in its 
infancy. This subject was dubbed computer vision in the years that followed. 
In 2012, a group of researchers from the University of Toronto developed 
an Al model that outperformed the best picture recognition algorithms by a 
significant margin, ushering in a new era in computer vision. 
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Figure 5.4 CNN architecture. 


The Alex-Net Al system (named after its creator, Alex Krizhevsky) 
won the 2012 ImageNet computer vision challenge with an incredible 85% 
accuracy. The runner-up scored a respectable 74% on the test. Alex-Net used 
convolutional neural networks, a kind of neural network that approximates 
human vision. CNNs have become an essential component of many com- 
puter vision applications over time and are thus included in every online 
computer vision course. So, let us take a closer look at how CNNs work. In 
the following session, we will go through numerous pre-trained models for 
image creation (IC) in computer vision applications such as image captioning, 
neural style transfer (NST), anomaly detection, and image categorization. 


5.3.8 Alex-Net Architecture 


Deep CNN-based Alex-Net architecture is used in this study for real-time 
detection and recognition of obstacles such as vehicles, doors, pillars, and 
stairs for visually impaired people. Alex-Net, a pre-trained model, is used to 
extract deep features for the classification of complex images that cannot be 
classified using simple handcrafted features [6]. 


5.3.9 Xception Model 


Francois Chollet proposes the Xception Model, which is an extension of the 
inception architecture in which depth-wise separable convolutions replace 
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Figure 5.5  Alex-Net architecture. 


the normal inception modules. It is a 71-layer deep pre-trained version of 
a convolutional neural network trained on more than a million photos from 
the ImageNet database. The network can categorize images into 1000 object 
categories, like pillar, car, stair, and a variety of other obstacles. As a conse- 
quence, the network has learnt rich feature representations for a wide range of 
images, which may be used to categorize barriers for persons who are visually 
impaired. 


5.3.10 Visual Geometry Group (VGG16,19) 


ImageNet, a massive visual database project utilized in visual object recog- 
nition software, uses VGG16, a basic and widely used convolutional neural 
network (CNN) architecture. Karen Simonyan and Andrew Zisserman of the 
University of Oxford created and launched the VGG16 architecture in their 
essay “Very Deep Convolutional Networks for Large-Scale Image Recogni- 
tion” in 2014. The term “VGG” stands for Visual Geometry Group, a group of 
scholars at the University of Oxford who designed this architecture, and the 
number “16” indicates that there are 16 levels in this architecture. Many deep 
learning image classification approaches employ VGG16, which is popular 
owing to its ease of use. Because of its advantages, VGG16 is often utilized in 
learning applications. VGG16 is a CNN architecture that won the ImageNet 
Large Scale Visual Recognition Challenge (ILSVRC) in 2014 and is still 
considered one of the best vision architectures available today. VGG19 has 
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19 layers, including three blocks of extra convolution layers. Because of their 
deep layers, both VGG16 and VGG19 succeed in the image competition [10]. 


5.3.11 Residual Neural Network (ResNet) 


A residual neural network (ResNet) is a form of artificial neural network 
(ANN) that is based on the cerebral cortex’s pyramidal cell structures. Skip 
connections, sometimes known as shortcuts, are utilized by residual neural 
networks to bypass certain layers. The bulk of ResNet models include double- 
or triple-layer skips interspersed with nonlinearities (ReLU) and batch nor- 
malization. An additional weight matrix can be used to learn the skip weights, 
which is recognized as HighwayNets. Skipping connections is done for two 
reasons: first, to prevent the issue of vanishing gradients, and, second, to 
avoid the deterioration that occurs when adding extra layers to a sufficiently 
deep model leads to increased training error. ResNet architecture comes in 
a variety of forms, each with the same principle but a different number 
of layers. ResNet-18, ResNet-164, ResNet-34, ResNet-101, ResNet-1202, 
ResNet-110, ResNet-152, ResNet-164, ResNet-50, and others are examples 
of variations [9, 35]. 


5.3.12 Inception (V2, V3, InceptionResNet) 


GoogLeNet [34], the winner of the ImageNet Large Scale Visual Recognition 
Challenge (ILSVRC) 2014, has 22 layers and an inception network. It also 
emphasizes the significance of depth. GoogLeNet, on the other hand, uses 12 
times fewer parameters than Alex-Net and achieves human-like performance. 
Inception V2 researchers compute the mean and standard deviation of all 
feature maps at the output of a layer and use these values to normalize the 
responses. Later, Inception V3 is created by carefully building networks and 
using 3*3 and 1*1 filters rather than other. 

Google provides InceptionResNet, which was inspired by ResNet’s per- 
formance. Residual is added to the output of the convolution operation of 
the inception module. After convolution, the depth is increased, and this 
model achieves a top-5 error on ImageNet classification. With 467 layers, 
InceptionResNet combines the notions of an inception network with a deep 
residual network to speed up training and improve accuracy. 


5.3.13 MobileNet 


MobileNet employs depth-wise separable convolutions. Depth-wise convo- 
lution and point-wise convolution are the two techniques that make up a 
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depth-wise separable convolution. The input and output channels, as well 
as the spatial dimension of the feature maps, are all affected by traditional 
convolution. Each input channel is individually transferred to a single convo- 
lution in a depth-wise convolution. As a result, the number of output channels 
and input channels are equal. MobileNet also offers two more parameters 
that may be used to reduce the number of activities even further: The width 
multiplier (which can be anywhere between O and 1) lowers the number of 
channels. 

The sole difference between MobileNetV2 and the original MobileNet 
is that it exclusively employs inverted residual blocks with bottlenecking 
characteristics. It has a significantly fewer number of parameters than the 
original MobileNet. Any image size bigger than 32 x 32 is supported by 
MobileNets, with larger image sizes providing better performance. 


5.3.14 DenseNet 


DenseNet was created primarily to address the effect of disappearing gradi- 
ents on the accuracy of high-level neural networks. Simply said, information 
evaporates before it reaches its destination due to the longer travel between 
the input and output levels. 


5.3.15 Experimental Results Analysis 


The design has been experimentally examined. In this section, the following 
results explain how the desired path can be selected and how the obstacle 
detection system is working. Finally, the results show how the control system 
works and experiment with it. Also, the experiment represents the recognition 
of four different obstacles (i.e., vehicles, doors, pillars, and stairs) for visually 
impaired people. Four different performance metrics are used to evaluate 
the proposed method. These performance metrics are accuracy, F'1-score, 
precision, and recall and are given in the following equations: 


TN + TP 


A = 4 
curacy ~ TN+TP+FN + FP 0.) 
TP 
iS a dna 2 
Reca (TP + FN) (312) 
TP 
Precision = === (5.3) 


(TP + FP) 
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Precision * Recall 
Fl — =2 4 
E Precision + Recall (28) 


The suggested system’s accuracy, precision, recall, and F1-score were 
computed using the various measurement parameters. Table 5.2 shows that 
the proposed method achieved the best performance with an accuracy of 
99.63%, a precision of 99.59%, a recall of 99.31%, and an Fl-score of 
99.56%. 

The proposed model is trained on a pre-trained Caffe Alex-Net model 
which has the best validation accuracy. The validation accuracy of the model 
during training is 99.5139% and has a loss validation of 0.0201% as shown 
in the below graph. Based on the results, the Alex-Net model is chosen and 
a computer-aided detection (CAD) system is developed using Arduino UNO, 
ultrasonic sensors, and camera. The research proposed is compared with the 
present research and is shown in Table 5.3. 

The confusion matrices for the classification of deep learning models 
were calculated for the activities of VIP. For each activity, a comparison is 
made with performance matrices such as preciseness, precision, recalls, and 
F1 score. 

The experimental outcomes of the suggested VIP methods can be illus- 
trated in Figure 5.6. The four types of obstacles (obstacles1, obstacles2, 
obstacles3, and obstacles4) are recognized from the pre-trained Alex-Net 
architecture. 


Table 5.2 Performance results obtained from pre-trained CNN models (Alex-Net) 


Categories Accuracy Precision Recall F1-score 
Obstacles1 99.61% 99.04% 100% 99.51% 
Obstacles2 99.70% 99.35% 100% 99.67% 
Obstacles3 99.23% 100% 98.23% 99.10% 
Obstacles4 100% 100% 100% 100% 
Total 99.63% 99.59% 99.31% 99.56% 


Table 5.3 Performance results comparison with previous work 


Categories Accuracy False alarm | Precision Recall F'1-score 
rate 

[13] 97% 0.016% NA NA NA 

Assistive 99.56% 0.0201 % 99.59% 99.31% 99.5% 

algorithm 

(proposed) 
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Figure 5.6 Loss and cross-validation accuracy. 
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Figure 5.7 Confusion matrix. 
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5.4 Conclusion and Future Directions 


This paper proposes a high-level framework for interpreting semantic items 
in the physical environment utilizing real-time localization methods. With 
the development of such an intelligent system, we introduce the operating 
concept that allows users to travel on their own. Our system’s capability can 
be seen both indoors and outdoors without the aid of the guidance of a sighted 
person. For this purpose, the items are detected in three directions using three 
sensors (i.e., front, left, and from right) and one is used to detect pothole. 
The visual sensor is used for the recognition of obstacles in the way of smart 
cane users. Overall, the proposed design system provides advantages instead 
of traditional cane. Improvements could be done to make sure that the system 
more efficient and effective as compared to the currently proposed design. In 
the future, we aim to install GPS which helps the visually impaired person in 
an outdoor location that helps their relatives to find them easily and provide 
a guideline. Also, the proposed system presented thus far provides accurate 
detection but to provide intelligent directing in terms of obstacle avoidance; 
it is strongly advised that a newly created neuro-fuzzy control algorithm be 
programmed into the microcontroller. Another suggested technique is to use 
excellent outdoor navigation guiding system; the created system might be 
used with RFID. Last but most import recommendation is to include battery 
monitoring circuit in the system due to the high-power consumption of the 
designed system. The precision of obstacle detection will be hampered by a 
lack of current supply. 
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Abstract 


Internet of Things (loT) based systems are used to define communication 
systems based on machine-to-machine interaction. loT when integrated with 
convolution neural network (CNN) can provide a system that can commu- 
nicate with surroundings using human speech. Natural language processing 
(NLP) can interact with loT-based deep learning systems to provide develop- 
ment in the automation field. loT can connect a network of specific devices 
and exploit deep learning for feature extraction, namely sensor features, 
radio frequency features, and speech features. loT with NLP can develop 
speech-based recognition systems for home automation systems. Smart home 
applications can be integrated with voice-command-based loT devices to 
communicate specific commands to the devices. In addition, NLP-based IoT 
devices can help disabled people to perform their daily activities. These 
devices can monitor their health and provide voice-based security alerts. 
Also, NLP-enabled IoT devices can be helpful for automating environmental 
data collections which include geographical activities. However, NLP-based 
loT implementation has certain limitations, namely language understanding, 
change in accent, and change in voice. These challenges restrict the efficient 
and quick utilization of NLP-based loT devices. The deep learning technol- 
ogy with a big vocabulary database has provided numerous opportunities to 
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train the voice and command recognition system in IoT. loT enabled CNN 
devices for voice recognition act as a boon to society. 


Keywords: IoT, deep learning, NLP, home automation. 


6.1 Introduction 


Internet of Things (IoT) based devices are used to automate configuration sys- 
tems based on human speech interaction [1, 2]. The interaction can be either 
with the help of body gestures or voice commands. Emerging trends in deep 
learning technologies and cloud computing have improved the performance 
of IoT devices integrated with either voice recognition system or gesture con- 
trol systems. With the use of these advanced IoT devices, users are now able 
to perform their daily computing tasks more easily and efficiently. Generally, 
IoT devices include sensors, control devices, and routing devices. Also, IoT 
devices contain software focusing on networking, embedded systems, and 
middleware. All these devices when configured together then contribute a 
lot in automating the task with the support of deep learning technologies. 
The integration of natural language processing (NLP) in IoT devices can 
improve the ability and efficiency of the overall configuration system. Now, 
the devices can be trained to execute the real-time commands rather than to 
execute specific task based on some pre-fed commands. 

In recent years, loT has given many applications ranging from devel- 
opment of modern infrastructure and smart cities to different sectors like 
agriculture, healthcare, home automation, transportation, education, etc. [3]. 
The main power to these applications is provided by the smart learning mech- 
anism for prediction, pattern recognition, data extraction, or data analytics. 
There are many machine learning approaches and algorithms which can be 
employed for the intelligent learning mechanism, but, in recent years, deep 
learning is the one which is mainly employed for many IoT applications. The 
imperative reason behind this dominance of deep learning in the IoT devices 
may be due to the emerging need of analytics which cannot be fulfilled by 
the traditional machine learning approaches. Instead, there is a need for a 
variety of modern methods of data analysis, artificial intelligence, and NLP 
for handling big data from modern day IoT systems. Figure 6.1 illustrates the 
details of various application areas of NLP-enabled deep-learning-based IoT 
devices. 

With advancement in technology, IoT will make a significant annual 
economic impact of about $2.7 to $6.2 trillion by 2025 by the McKinsey’s 
report on the global economic impact of IoT [4]. This study predicted that the 
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Applications of NLP 
enabled loT devices 


To dictate clinical Predictive 
notes for medical analysis for 
records mental health 


Voice based security 
alert system 


Voice command 
based home Health Monitoring 
automation 


Medical reports 
analysis 


> Chatbots for 3 
Automating medicinal reminders Physiological 


Appliances and health assistance analysis 


Figure 6.1 Details of various applications of deep-learning-based NLP-enabled IoT devices. 


healthcare sector will be a major contributor toward it, followed by industry 
and energy. On the other hand, transportation, security, urban infrastructure, 
agriculture, and retail will be the minor contributors but altogether con- 
tributing a good chunk to the annual economic impact. Also, the report 
defines impact of machine learning under automation of knowledge work. 
The report states that advances in ML techniques, such as deep learning and 
neural networks, are major contributors in automation. Others like speech and 
gesture recognition will also be highly benefitting ML technologies. In the 
report, an estimate of $5.2 trillion to $6.7 trillion worth of potential economic 
impact of knowledge work automation has been made for the year 2025 for 
IoT systems. 

IoT systems with the help of deep learning technologies and NLP devices 
can provide a system that can communicate with surroundings using human 
speech. NLP can interact with IoT-based deep learning systems to pro- 
vide development in the automation field. IoT can connect a network of 
specific devices and exploit deep learning for feature extraction, namely 
sensor features, radio frequency features, and speech features. IoT with 
NLP can develop speech-based recognition systems for home automation 
systems. Smart home applications can be integrated with voice-command- 
based IoT devices to communicate specific commands to the devices [5]. 
In addition, NLP-based IoT devices can help disabled people to perform 
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their daily activities. These devices can monitor their health and provide 
voice-based security alerts. Also, NLP-enabled IoT devices can be helpful 
for automating environmental data collections which include geographical 
activities. However, NLP-based IoT implementation has certain limitations, 
namely language understanding, change in accent, and change in voice. 
These challenges restrict the efficient and quick utilization of NLP-based IoT 
devices. The deep learning technology with a big vocabulary database has 
provided numerous opportunities to train the voice and command recognition 
system in IoT. In sum, loT-enabled CNN devices for voice recognition act as a 
boon to society. These latest technologies when embedded in IoT devices can 
not only improve the efficiency of the existing system but also provide bene- 
fits to the users. Real-time voice-command-based and gesture-control-based 
IoT devices provide world class experience to the people. 


6.2 Related Work 


In the past few years, a lot of work has been proposed in IoT and deep 
learning to find out the possible applications of combined implementation 
of both the fields. Table 6.1 tabulates the representative work using deep 


Table 6.1 Representative work under various gesture recognition systems with the perfor- 
mance evaluation 


Authors Gesture Performance Summary 

metrics 
Wang et al. Lip motion | 91% accuracy WiHear, a system to hear people 
[10] talk, using Wi-Fi signals. 
Abdelnasser | Hand 90% accuracy WiGest, uses Wi-Fi RSSI to 
etal. [11] gestures detect hand gestures. 


Liu et al. [12] | Keystrokes | 77.43% accuracy, 30 | This system is based on Wi-Fi 
training samples per | signals. 

key 

And 93.47% 
accuracy, 80 training 
samples per key 


Qian et al. Motion 92% accuracy WiDance, Wi-Fi CSI-based 

[13] direction motion direction sensing. 
Virmani et al. | Activity 91.4% accuracy WiAg, Wi-Fi based configuration 
[14] recognition recognition system and virtual 


samples generated for all possible 
gestures using translation 
function. 
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learning techniques based on gesture recognition. The authors have proposed 
novel architectures for implementing deep-learning-based IoT systems. In 
this direction, Gladence et al. [6] proposed a significant progress to realize 
intelligent environments which are capable of providing needed services 
automatically for user comfort by detecting user’s actions and gestures. A 
system which automatically turns on the AC when a person enters and sits 
in a room has been proposed. The basic idea behind this system was to 
automatically turn appliances on/off based on user’s activity. In this way, 
IoT automation can help in saving energy along with keeping up with the 
needs of humans. Mohammadi et al. [7] performed research primarily in 
deep learning applications in IoT for big data and streaming analytics. The 
authors discussed about different deep learning techniques with an overview 
of application of these techniques in the field of IoT [7, 8]. Various frame- 
works have been proposed by exploiting deep learning with other machine 
learning approaches such as reinforcement learning, online learning, transfer 
learning, etc. In addition, applications of deep learning in IoT were discussed 
for different sectors such as smart city, energy, intelligent transportation 
system, etc. Jiang et al. [9] studied Wi-Fi sensing applications in health 
monitoring and gesture recognition. The authors provided a brief review 
of Wi-Fi enabled gesture recognition, where a comparison between various 
gesture recognition systems has been discussed along with their accuracies. 
The research predicted that gestures like lip motion, hand gestures, motion 
direction, and keystrokes can be recognized with the help of Wi-Fi based 
gesture recognition system. The system achieved an exceptional accuracy of 
around 90%. 

Hussain et al. [3] reviewed the applications of machine learning tech- 
niques in IoT security challenges and threats. In this, reinforcement learning, 
deep learning, and their applications in different security problems in IoT 
network were studied. Machine learning based solutions for loT security 
have been provided to address the challenges and threats. Mandula et al. [15] 
analyzed the utilization of IoT in home automation using an Android mobile 
app and micro-controller-based Arduino board. The authors proposed two 
models for home automation, one of which used Bluetooth and the other one 
used ethernet for network communication. Due to limitations of Bluetooth 
connectivity range, the first model was utilized for indoor domain, whereas 
ethernet-based model was capable of handling from outside situations. Xiao 
et al. [16] reviewed the ML-based methods for data privacy and security in 
the IoT. In their study, they also identified several challenges that have to be 
addressed in ML-based security techniques in practical loT systems such as 
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partial state observation, computation and communication overhead, backup, 
and security solutions. 

In sum, a lot of work has been proposed using machine-learning-based 
technique for home automation, addressing the security-related challenges 
and threats. In addition, gestures-based systems were also utilized to make the 
experience simpler and useful. The next section will review the deep learning 
work inspired by voice automation system using IoT. 


6.3 loT-Enabled CNN for NLP 


IoT is a system in which various devices having capability to operate over 
Internet are connected together for performing one or multiple operations 
such as processing, signaling, sensing, recognition, etc. [17]. As Internet has 
evolved a lot in past years, it has become simpler and faster to communicate 
between devices and computers which have led to a major development in 
the field of IoT, providing with the faster and effective way of information 
retrieval and automation. There has been a lot of development in the field 
of machine learning, especially in deep learning where the CNNs have been 
applied for pattern recognition, image processing, NLP, object tracking, etc. 
Research works in the past few years have helped a lot in making smart 
systems, where the devices are now able to detect objects and people, patterns 
in images, and voice. So, the researches in both fields have opened new ways 
of automation where we can automate our day-to-day tasks with the help of 
smart devices using machine learning algorithms for executing tasks along 
with the IoT devices in physical surroundings. Here, further deep learning 
with IoT devices, more specifically, applications of CNN-based IoT devices, 
has been tabulated in Table 6.2. 

It has been evident that CNN is the most versatile in comparison to DNN 
model when applied to the loT domains [27, 28]. CNN has the capability 
to recognize even smaller patterns in the data provided which helps it to 
outperform other deep learning models not just in IoT domain but also in 
many of the other applications. 


6.4 Applications of loT-Enabled CNN for NLP 


This section presents a review of applications of IoT-enabled CNN for 
NLP. IoT is a major contributor to home automation. Many technologies 
like Bluetooth [31], GSM [32-34], Zigbee, etc., are used in home automa- 
tion techniques. GSM is an ideal remote communication technology where 
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Table 6.2 Details of various deep learning models and the application description 


Application 


CNN 


LSTM 


RBM 


RNN 


Image 
recognition 


CNN for plant 
disease 
Recognition. 
Overall, 96.3% 
accuracy [18]. 


Model was 
based on 
two-layered 
LSTM [19]. 


Model achieved 
66% top-1 
accuracy and 
80% top-3 
accuracy [20]. 


Physiological 
detection 


Used a model 
based on five 
convolution 
layers and three 
pooling layers 
[21]. 


Model 
combines CNN 
and LSTM for 
activity 
recognitions 
[22]. 


RNN improved 
the score of the 
proposed 
architecture 
[23]. 


Localization 


Used faster 
RCNN for 
integrating 
feature 
extraction and 
classification 
into one 
network [24]. 


RBM-based 
indoor fin- 
gerprinting 
scheme 
using CSI 
[25]. 


Smart home 


Gesture 
controlled 
home 
automation 
system 
achieved an 
accuracy of 
98.12% [26]. 


Healthcare 


CNN with ten 
convolution 
layers and two 
fully connected 
layers was used 
to detect 
cardiovascular 
disease from 
mammograms 
[27]. 


RBM-based 
posture 
analysis in 
fall 
detection 
and classifi- 
cation 

[28]. 


Sports 


AlexNet-based 
hierarchical 
model for 
group activity 
recognition on 
volleyball 
dataset [29]. 


RNN model for 
classifying 
NBA offensive 
plays [20]. 


(Continued) 
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Table 6.2 Continued. 
Application CNN LSTM RBM RNN 
IoT CNN for 
infrastructure | identifying 
wireless 
network 
interference 
[30]. 


Internet is not available. Wi-Fi technology due to its communication speed 
and availability in almost every part of the world is preferred for NLP [35, 36] 
and other deep learning applications. The advancements in Internet and 
today’s technologies like cloud computing, artificial intelligence, and wireless 
networking have opened more fields for IoT to contribute. IoT along with 
deep learning has provided with many applications in healthcare domains 
such as medical diagnosis, disease prediction, and personal healthcare appli- 
cations. Major application in healthcare is in physiological domain where 
the user’s health data can be collected through smartphone sensors or other 
wearable devices with the help of compact IoT devices by implementing 
motion recognition/detection techniques with the help of deep learning. The 
great causes for home automation are to assist shape lives of humans with 
extreme disabilities [37]. NLP-enabled home automation systems can help 
assist people who are physically challenged but can speak and listen, hence 
making them dependent on other people. Figure 6.2 illustrates the architecture 
of basic home automation system based on speech recognition. Further, in this 
section, we will focus on home automation and how it can be implemented to 
help disabled people with NLP-based IoT systems. 


Voice capturing 
through android 
device 


Communication Microcontroller 


NLP processing network for loT device 


Figure 6.2 Basic architecture for speech-recognition-based IoT automation. The basic flow 
of information with the details of hardware devices has been shown. 
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Table 6.3 Details of various deep learning models with the description in IoT applications 


Deep learning Type of model IoT applications Summary 

model 

RNN Supervised Identify movement Useful for 
pattern time-dependent 
Behavior detection [38] | data 

CNN Supervised Traffic sign detection Visual tasks 
[39] such as pattern 

recognition 

LSTM Supervised Human activity Applicable on 

recognition [40] problems with 


time-series data, 
where there is 
long time lag in 


data 
RBM Unsupervised, Indoor localization [41] | Feature 
Supervised extraction and 
dimensionality 
reduction 
Auto-encoders Unsupervised Fault detection [42] Training data 


with noise; noise 
is removed by 
encoding data 
into a complex 
function for 
refinement 


6.4.1 Home Automation 


IoT has a major application in smart home automation. IoT-based systems 
are used in home automation mainly realized in day-to-day appliances such 
as fridge, TV, air conditioners, and heating systems for ease of controlling. 
IoT has provided us with a remote access to our home appliances, such 
as remotely maintaining the home temperature and monitoring surrounding 
changes. Indoor localization is a popular topic in home automation [39, 40]. 
Indoor localization is an alternative technique for GPS. Satellite technologies 
such as GPS lack precision in places like airports, parking garages, buildings, 
and underground locations. Indoor localization uses a network of devices to 
locate people and objects in small places where the technologies like GPS 
lose precision or fail to track [41, 42]. Indoor localization enables many 
services like intrusion detection, person monitoring in a localized place, etc. 
Traditional indoor positioning systems use K-nearest neighbors, Bayesian 
model, or SVMs. 
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Paramvir and Venkata [43] proposed an RF-based user location and 
tracking system. They collected signal strength information for 70 distinct 
physical locations in all four directions on the floor. For basic analysis, 
only single nearest neighbor in signal space was selected. Further KNN was 
applied to signal space for finding the location of the user. The system offers 
better accuracy for smaller values of k in comparison to its large values. 
This may be due to deviation of true location due to averaging of signals 
over the signal space. As the data keeps evolving and is getting bigger, this 
poses a problem with the traditional machine learning approach in indoor 
localization. So, the researchers have turned toward deep learning. In this 
direction, Gu et al. [44] proposed a model for Wi-Fi based indoor localization 
which used deep learning to improve feature classification. To ensure fast 
learning speed, semi-supervised extreme learning machine (ELM) was also 
introduced. Gesture-based recognition system has been proposed for home 
automation with loT. Mainly, NLP-based smart home automation systems use 
a normal sound receiving system like through a microphone. The sounds or 
voice instructions can be collected through a microphone of a smart phone or 
through a microphone mounted on a micro-controller of an IoT device in the 
network. Digital personal assistants such as Apple’s Siri, Google Assistant, 
Amazon Alexa, Microsoft's Cortana, etc., are all based on deep learning and 
have shown great success [45]. They operate on voice commands given by 
the user. Assistants like these can be successfully implemented within the IoT 
networks for taking voice commands to perform a series of tasks. Modern day 
NLP is not just limited to recognizing the instructions given through text or 
speech, but along with speech recognition, it is also capable of recognizing 
the person whose voice is being used to give the instructions, hence providing 
a way of keeping the loT-enabled devices secured and also reducing misuse of 
the devices by an unauthorized person. So, this can be seen as an advantage of 
NLP-based IoT systems as they will be able to take commands and recognize 
1f the user is authorized to use the device in the network; only then the given 
command will be taken as input by the loT device to initiate the required 
action. 

Hence, integration of NLP with loT could be a further great development 
in the field of home automation where NLP can provide us with a way to use 
short messages or voice notes to interact with home appliances such as AC 
and refrigerators. CNN-based loT architecture can be majorly implemented 
for home security where CNN models integrated with IoT systems can 
be used for tracking activity inside and outside the house. For example, 
an automated face detection model can be integrated with loT system for 
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automatically opening the doors for the people authorized for entering the 
premises. As the mobile technology is advancing quickly, indoor localization 
has become a popular research topic. Also, the technology is not limited 
to home automation systems, but a lot of work has been proposed to make 
disabled life easier. The next section will detail about the existing work which 
helps to improve the life experience to disabled people. 


6.4.2 Boon for Disabled People 


Home automation is not just limited to automating appliances. Physically 
challenged people are much more dependent on others for their daily activ- 
ities. NLP-enabled IoT solutions can surely help ease of their lives. In the 
past years, monitoring health of a patient or a disabled person was a difficult 
task for doctors and family members. The advancements in IoT have enabled 
us to develop micro-devices for health monitoring which has improved the 
quality of services for a patient or disabled person. Many researchers have 
proposed and developed android-based models which use IoT devices in 
physical environment for tracking health information of disabled people such 
as heart-rate, oxygen level, body temperature, etc. [46-48]. These devices use 
Bluetooth to share information over close proximity and have been integrated 
with speech-based SMS system for sharing messages and connecting with 
the care taker as shown in Figure 6.3. Techniques like dictionary search are 
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Figure 6.3 loT-based model for health monitoring. NLP-enabled health monitoring devices 


have been integrated with connecting services like Internet or Bluetooth. Details of health 
record are saved on cloud for remote access by health organizations. 
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also being integrated with the speech recognition system for personalizing the 
health assistant for better results in remote communication between a disabled 
person and the care taker [49]. 

IoT devices can be integrated with deep learning for supporting physically 
challenged people for the ease of their life, making them independent for 
daily and frequent activities such as opening doors or windows, switching 
appliances on/off based on voice commands, etc. Gonzalez et al. [50] pro- 
posed a device for enhancing the quality of life of visually disabled people. 
However, the device was capable of performing many tasks such as email 
reading, medication reminder, music player, and meditation reminder. But the 
major functionality of this device was recognizing people and detecting any 
obstacle in the way. The device uses ultrasonic sensors for obstacle detection, 
which sends and receives wave signals for processing by the Raspberry PI 
module for detecting objects. For recognizing people, the device is fitted 
with a video camera which sends images captured to the computer vision 
module for face detection and recognition which are done with the help of 
Haar Wavelet and Fisher Faces, respectively. Another important research with 
this is proposed by Bhargava et al. [51]. The main aim of this research was 
to develop an image to text to speech system for the betterment of visually 
impaired. The author proposed an IoT system integrated with image process- 
ing and NLP module. The system operates on Raspberry PI with a camera 
module for image acquisition. The function of the system was to detect the 
text from the images using image processing techniques and convert it to 
text using OCR. The detected text is then saved for future reference or to 
be processed by the NLP module for dictating the information present in the 
image to the visually impaired people. 

Automated doors can be implemented in home and other places for assis- 
tance of disabled people. For example, implementation of deep learning with 
IoT devices can help automating opening and closing doors based on voice 
commands or by detecting motion around them by providing an easy access 
to disabled people for entering the house without any help. Security systems 
and devices can be built with the help of deep-learning-based IoT devices 
for object detection and voice detection which can be installed for security 
surveillance around or inside the house for providing alerts of intrusion to 
the respected authority for the safety of old people and physically challenged 
people. In this direction, Wadhwani et al. designed an Arduino-based system 
to send alerts to the owner in case of any trespassing [52, 53]. IoT-enabled 
deep learning models can help us to develop devices for this kind of people 
to make their lives easy and independent of others. 
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6.5 Applications of loT-Enabled CNN 


loT has impacted many fields and provided with real-time solutions for the 
problems. These include farming, infrastructure, and many more. 


6.5.1 Smart Farming 


Agriculture is an important aspect of sustaining human life by breeding 
plants and animals to meet the food demand. Before the technological 
advancement in Internet, traditional farming relied on intensive farming 
practices. However, with the rising demand of food due to increasing pop- 
ulation and certain environmental factors such as rising climate change, 
extreme weather conditions, etc., there is a paradigm shift from intensive 
farming practices toward smart farming. In a span of the past 10 years, 
many researchers have performed research on loT-based smart farming which 
includes IoT-based irrigation systems, IoT modules for monitoring soil mois- 
ture, soil temperature, and other such important characteristics of soil and 
environment required for a good yield farming. With the recent developments 
in deep learning, researchers have started to look forward toward using CNN 
for much better automation through loT modules or systems. Since then, 
many researchers have proposed such CNN-based loT systems and have also 
succeeded in developing these kinds of automated systems for smart and 
efficient farming. 

Shekhar et al. [54] have proposed a system based on the loT-based 
Irrigation system. The system was fully automated and monitored the soil 
temperature and moisture through a sensor network. These important charac- 
teristics captured by the sensors were used to prepare a dataset for predicting 
the requirement of soil irrigation. These predictions were made by a rather 
simpler machine learning algorithm, K-nearest neighbors or KNN. The 
system includes Arduino Uno and Raspberry Pi3 as key components for 
machine-to-machine communication for controlling the whole irrigation pro- 
cess. Another important part of this system was that the prepared dataset 
and anticipated dataset both were made available to the farmer through cloud 
server. 

Figure 6.5 illustrates a basic architecture for ML-based IoT controlled 
irrigation system. A similar one has been proposed by the authors of [55]. The 
data acquisition module consists of sensor networks for collecting data related 
to environmental conditions such as moisture level, surrounding temperature, 
etc. This data is stored on regular basis for training purpose or directly trans- 
mitted to the ML module for analysis and irrigation requirement predictions. 
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Figure 6.5 ML automated IoT controlled irrigation system. 


A further improvement to this system could be a two-way communication 
between cloud storage and database. This could help in better predictions by 
considering other factors such as soil composition, nutrition levels, moisture 
requirement of a certain crop, and other related parameters which change 
according to the type of crop being cultivated. This is due to the fact that 
these factors can only be analyzed by certain laboratory tests, which, in turn, 
need to be updated regularly by the end user. 

Luigi et al. [62] proposed a CNN-based model for plant disease detection 
and diagnosis by analyzing the leaves. The model was trained with about 
87,800 plant images of 25 different plants categorized into 58 different plant- 
disease combinations. The best performance was achieved with a testing 
dataset of about 17,550 images with an accuracy of 99.53% in identifying 
plant-disease combinations. The model used was VGGCNN. This research 
infers the success of using CNN in real time for analyzing different plant 
images so as to detect plant diseases. Indhu Mathi ef al. [63] proposed an 
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Table 6.4 Details of various proposed deep learning models based on available deep learning 
frameworks for the application in smart farming 


Smart farming 
applications 


Framework 


Performance 
metrics 


Description 


Recognize 
plant species 


AlexNet 


99.5% accuracy 


Authors [56] prepared two 
datasets from the original one 
using data augmentation 
techniques and concluded that 
venation structure is an 
important feature for 
recognizing plant species. 


Identify crop 
species and 
diseases 


AlexNet + 
GoogLeNet 


0.9934 
F1-score and 
99.35% 
accuracy 


The authors [57] experimented 
with these two deep learning 
architectures following two 
training mechanisms, transfer 
learning and training from 
scratch with three types of 
datasets colored, grayscale, 
and segmented. As expected, 
transfer learning technique 
gave better results which were 
found with GoogLeNet. 


Leaf disease 
detection 


CaffeNet 


96.3% accuracy 


A new dataset was generated 
from available datasets by 
using image augmentation 
techniques. An important 
aspect of this research [58] is 
that the dataset includes 
background images which are 
classified with an accuracy of 
98.21% which gives good 
separation of plant leaves and 
surroundings. 


Classify weeds 
from crop 


VGGNET 


86.20% 
classification 
accuracy 


The authors [59] used six 
different datasets which 
included 22 different species 
to prepare a new dataset 
through image augmentation 
techniques. A 50% dropout 
rate was applied before two 
fully connected layers. A 
mini-batch of 200 images was 
selected for training the model 
initialized with weights from 
VGG16 network. 
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Table 6.4 Continued. 


Smart farming | Framework Performance Description 
applications metrics 
Fruit counting Inception- 70%-100% Authors [60] came up with a 
ResNet with an average | new CNN model which was 
accuracy of integrated with four layers of a 
91.03% modified Inception-ResNet-A 


and a modified inception 
network. The training was 
done over 24,000 images with 
Adam optimizer and initial 
weights were taken from 
Xavier initializer. 


Crop disease Deep ResNet Balanced The integral part of the system 
classification accuracy of [61] was image processing 

for early 0.84 in real module which was based on 
disease testing superpixel-based tile 
symptoms conditions extraction which was achieved 


using SLIC superpixel 
extraction algorithm. The 
significance of this part was to 
avoid degrading of visual 
features of early signs of 
diseases by avoiding the 
downscaling of part of image 
where the disease could be 
present. 


automated diagnosis and classification system. Two clusters, healthy leaf 
areas and infected areas, were formed for each plant image. On the basis 
of selected cluster, features were extracted from that cluster to estimate the 
percentage of infected leaves. The algorithm comprises mainly three parts: 
first is K-means for clustering, then GLCM or gray level co-occurrence matrix 
for feature selection, and then, finally, SVM or support vector machine for 
classification so as to identify the plant disease. 

Santosh et al. [64] proposed a system based on standard YOLO model 
for detecting tomato plant diseases. The authors used a dataset consisting of 
image samples of leaves of three types of tomato plant diseases namely Gray 
Spot, Late Blight, and Bacterial Canker, further adding 275 image samples of 
healthy tomato plant leaves. The system’s computing unit was implemented 
on Raspberry PI with a graphical user interface (GUI) for collecting images 
and capturing videos. The main purpose of image processing module of 
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the system is image acquisition through IoT system and performing image 
augmentation and feature extraction on the acquired images or video. The 
average loss of the trained model is 0.0634 and the mean average precision 
(MAP) is 0.76. An overall accuracy of 89% is found over plant village dataset. 


6.5.2 Smart Infrastructure 


Nowadays, we are quite familiar with the term “smart cities,’ which means 
a city with a proper management and functioning so as to promote the 
economic growth. With the modern day changing world and increasing 
population, the cities are getting much more populated and the management 
is becoming a crucial and difficult task. The solution to this problem is a 
smart information and communication network which can possibly allow us 
to improve and automate the management and functioning tasks. As we are 
familiar with the thing that IoT can provide us with the smart monitoring 
systems through a network of sensors and receivers. We have also seen a sig- 
nificant research growth in developing machine learning systems which can 
help in decision making. Upon CNN integration with IoT, we can look toward 
more smarter and intelligent transportation systems [65], smart lighting, 
smart parking management, etc. Some of the researches based on intelligent 
traffic management and monitoring system are discussed in Table 6.5. 

Machine learning or artificial intelligence based IoT systems can be 
implemented for many different functionalities such as for monitoring vibra- 
tions in materials of bridges and buildings to check for some unusual sounds 
which can help in identifying air pockets and unusual gaps in the structure 
and further help in analyzing structural strength to prevent any possible 
collapse. A major part of these kinds of systems requires sensor networks, 
radio frequency identification, and video surveillance devices in order to 
collect the required information for computation by the machine learning 
module for decision making to implement the appropriate actions by the 
IoT module. Such machine-learning-based IoT systems can contribute a lot 
toward developing and managing modern day infrastructures in much smarter 
and automated ways. 


6.6 Challenges in NLP-Based loT Devices and Solutions 


Security and privacy are two big challenges in implementations of the IoT 
devices in commercial environments. IoT is becoming a major part of modern 
infrastructure dealing with billions of objects and humans. Hence, security in 
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Table 6.5 Intelligent traffic control and monitoring systems based on Al-integrated loT 
systems 

Authors System Description 

Ying et al. Intelligent traffic The system was an 

[66] control system artificial-intelligence-based loT system 
for handling traffic lights. It was a 
distributed multi-agent O learning system 
which could be deployed for controlling 
local traffic lights for vehicles as well as 
pedestrians. Surveillance cameras were 
used to check the queue lengths which 
were considered as the number of 
vehicles and pedestrians waiting at the 
intersection. Different agents were 
deployed at different intersections with a 
computation and control module. 

Hasan Intelligent traffic The system is able to acquire real-time 
Omar et al. information system traffic information and monitor it. The 
[67] system was based on IoT and wireless 
communication. The architecture uses 
wireless sensors and active 
radio-frequency identification (RFID) for 
collecting real-time traffic data and 
monitoring the traffic flow which is done 
by the multi-agent-based system. These 
agents are autonomous and intelligent. 
Hence, they can interact in a useful way 
with their environment without much 
human interference. This system allows 
automatic representation, tracking, and 
querying of tagged traffic objects. 
Danping Urban traffic This was conducted on employing IoT 
Wang et al. guidance system devices for urban traffic control. The main 
[68] aim of developing this kind of system was 
to deal with the problem of traffic jam in 
urban cities. A large amount of data was 
collected through integrated devices on 
connected vehicles and roadside units. 
The traffic guidance information module 
acquires data related to road transport 
infrastructure and real-time transportation 
information. The optimal path search 
module implements resistive 
multi-objective-based database 
constrained optimal path algorithm. 
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loT is a necessary and complex task to assure information security and con- 
sistent operation of the devices. loT is like a smart wireless sensor network, 
somewhat architecturally inheriting from wireless sensor networks (WSNs). 
So, it also inherits the same security challenges from WSNs such as network 
breach, physical tampering, jamming, selective forwarding, wormhole attack, 
etc. [69]. As the main communication channel for loT is the Internet, many of 
the loT devices are integrated with the Internet for remote access which opens 
the backdoors for security attacks, such as simple security breaches for stop- 
ping the workflow of the device. The security attacks can be classified into 
two categories, namely “passive attacks” and “active attacks” [70]. Passive 
attacks are focused on deducing system information by monitoring messages 
and data transmission. The main motive is to gather information from the 
system rather than affecting system resources and operations. These kinds of 
attacks can go undetected because the aim is to deduce system information 
instead of altering any data or operation. Active attack is when the motive of 
attack is to alter system resources and affect its operation. Figure 6.6 shows 
an illustration of passive and active attacks. 

Implementation of NLP also poses a major challenge of data stealing in 
the case of security breach where training data, such as facial recognition 
images or voice patterns, can be manipulated. A major limitation of the 
loT devices is due to their environment of operation, adding more security 
challenges to the operation of devices. To ensure data security, techniques like 
cryptography can be used [70]. Modern day cryptography technique is imple- 
mented in one of the two forms: “private-key cryptography” or “public-key 
cryptography.” Private-key method using symmetric key algorithm provides 
both the sender and receiver with a secret key for their communication. 
Whereas, public-key method includes asymmetric key algorithms, where a 
public key is provided to all the communication parties and its private key is 
kept secret [71, 72]. 

Confidentiality of data is preventing it from being accessible to any 
unauthorized user. As modern day loT systems are getting smarter, a larger 
amount of data is being generated. This analytics data needs to be handled 
carefully, as if this data is available to wrong hands, and then it could lead 
to user privacy violation or tampering of data which could lead to stalling or 
malfunctioning of loT devices in the network. 

Integrity is maintaining the completeness of data being transmitted over 
the network in a system. As loT systems include many devices connected 
together, communicating through certain transmission protocols over the 
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Figure 6.6 Details of various attacks on loT-based devices. (a) Passive attack. (b) Active 
attack. 


Internet or GSM. So, the chances of distortion or addition of noise in the 
data being transmitted over the communication channel are higher. 

Authentication is the process of identifying user or device. It is used to 
identify the user so as to provide usage access according to the need. A strong 
authentication is a must in an loT system to prevent unauthorized users from 
making control commands or having access to the system data which could 
lead to system abuse. 

Authorization is providing users access rights to an loT system. Users 
could include humans, sensors, or other IoT devices in the network. The main 
challenge is how to provide successful access to all kinds of user entities in 
such an environment [56]. In addition, the data and control should be made 
available to only authorized user. 

Availability is a fundamental feature which needs to be ensured for 
successful deployment, as authorized user entities should always have access 
to the loT system. Threats such as DoS, jamming, and node capture can lead 
to unavailability of the services by IoT devices. 
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Figure 6.7 Security challenges and hazards in IoT network. 


The above five security properties as shown in Figure 6.7 should be 
considered very well for ensuring security of IoT systems. Exploitation of 
any of these properties can put system at risk of several security threats which 
could lead to several issues. 


6.7 Conclusion 


In this chapter, we have reviewed and analyzed the work in the domain of 
IoT. IoT is an emerging domain with the capability to automate tasks and can 
improve the life experience for the disabled too. However, IoT has limited 
capability to recognize patterns, voice, or anomalies by learning from data. 
Application of CNN-based models in IoT has shown great development for 
automating tasks by addressing those issues. In addition, automation tasks 
can be made simpler and accessible by embedding voice commands/speech 
recognition through NLP. These tasks not only provide the customer a dif- 
ferent experience but also make disabled people more independent for their 
day-to-day tasks. Also, we have discussed various challenges and security 
issues related to CNN-based loT systems. In order to ensure the privacy 
and confidentiality of the user data, it is necessary to prevent the loT-based 
devices from cyberattacks. Various cyberattacks in loT have been reviewed 
and the details are tabulated with the solutions. 
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Abstract 


The categorization of ECG signals is critical in therapeutic treatment. Tra- 
ditional approaches have hit their performance ceiling, yet there is always 
room for improvement. The objective of this research is to use ECG signals 
to accurately locate and identify myocardial infarction (MI). In the proposed 
algorithm enhanced deep neural network (EDN), deep learning methods like 
convolutional neural network (CNN) and long short-term memory (LSTM) 
algorithms were applied. Vector operations such as matrix multiplication 
and gradient descent were conducted in parallel with GPU support on large 
matrices of data. Parallelism in EDN decreases the time it takes for a pro- 
cedure to run. For the PTB database, the proposed model EDN has a higher 
accuracy of 87.80% according to the confusion matrices of the algorithms. 
Based on criteria such as precision, recall, F1 measure, and Cohen kappa 
coefficient, the suggested model demonstrates performance improvisation. 
These enhancements to EDN’s performance would aid in the saving of human 
lives. 
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7.1 Introduction 


The most prevalent cause of cardiac muscle loss induced by persistent 
ischemia is myocardial infarction (MI), sometimes known as “heart attack.” 
Pain radiating from the chest to the shoulder, arm, and neck is the most 
common symptom of a myocardial infarction. Atherosclerosis is the most 
common cause of myocardial infarction, as evidenced by autopsystudies on 
the causes of MI. Atherosclerosis is the process of hardening blood arteries 
in the body, which is characterized by atheroma or lesions that appear in the 
blood vessels. The center of an atheromatous plaque is formed predominantly 
of cholesterol esters and cholesterol, and it is surrounded by white fibrous 
caps made mostly of smooth muscle cells. 

Myocardial infarctions are heart muscle deaths. It is a portion of the 
spectrum that arises in acute coronary syndrome and is an implication of 
coronary artery disease (CAD), which might result in suctioned blood flow 
to the heart. Regardless of other causes of acute coronary syndromes (ACS), 
such as unstable angina (UA), a myocardial infarction (MI) occurs when cells 
die, as determined by a trooping or CK-MB blood test of cardiac enzymes 
(Liu et al., 2018). When an MI is suspected, it is called an ST segment 
elevation myocardial infarction (STEMI) or a non-ST segment elevation 
myocardial infarction (NSTEMD), and it is discovered on ECG readings. 

The principal clinical occurrence in patients with atherosclerosis discov- 
ered on the coronary arteries is myocardial infarction (MI), often known as 
an acute coronary syndrome (ACS); however, the actual phrase is based on 
the existing definition under which its many presentations are generated. The 
formation of thrombus is a rare occurrence during the growth of a coronary 
artery blockage. As a result, coronary heart disease (CHD), also known as 
coronary artery disease (CAD) or ischemic heart disease (IHD), would be 
unlikely to be fatal in the absence of thrombus development. 

Myocardial infarction is one of the major challenges facing the healthcare 
system, with a global death rate of 265 per lakh population and around 224 in 
the Mediterranean region. According to estimates, deaths from cardiovascular 
diseases will increase by 15% in affluent countries with a rapid economic 
growth, such as the United States (Wang et al., 2019), 77% in China, and 
an astonishing 16% in other Asian countries. Finland conducted analytical 
investigations and determined the largest rate of myocardial infarction, while 
Japan projected the lowest rate. The British Heart Foundation predicts 0.6% 
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yearly rate for men and 0.1% for women for acute myocardial infarction in 
people aged 30-69 years. 

Specifically, the World Health Organization and all 194 member countries 
have decided to work on a general platform inducing global mechanisms to 
reduce CVD risk by 25% by 2025 by enhancing the “global action strategy 
that are essential for the prevention and measurement of NCDs 2013-2020” 
by focusing on the both prevention and controlling measures needed to detect 
the cardiovascular diseases. 

Cardiovascular disease is a potentially fatal ailment that affects more than 
17 million individuals worldwide each year. According to Abdulrazaq et al., 
an electrocardiogram (ECG) is a recording of the electrical function of the 
human heartbeat. MI is described as an abnormality or interruption in the 
myocardium's regular activation process. 

PTB database is used to gather MI signals. Because traditional ECG anal- 
ysis approaches are time-consuming and tedious, various automated solutions 
have arisen in recent years (Benjamin et al., 2017). With the advent of con- 
temporary signal processing, data mining, and machine learning techniques, 
the ECG’s diagnostic power has exploded. 

To detect and classify the ECG signal, numerous algorithms have been 
proposed. For displaying experimental findings, some methods use time 
domain, while others use frequency domain. Various unique qualities in ECG 
classification and recognizing the different pathological classes based on the 
different beats obtained from the ECG signal were defined based on the 
algorithms. 


7.2 Related Work 


According to the findings, ECG signal categorization is an important com- 
ponent that is predominantly used in the clinical diagnosis of heart illness. 
The biggest problem with utilizing ECG to diagnose heart illness is that a 
typical ECG varies from individual to individual, and it has been seen that 
one disease might exhibit distinct symptoms in ECG signals for different 
patients. According to the researchers, the two disorders may have similar 
impacts on normal ECG signals. Because of these issues, identifying heart 
disease is challenging. As a result, by using the pattern classifier method, it is 
possible to advance the patient’s ECG for MI diagnosis. This study provided 
a frugal solution. This study suggested a framework for ECG classification 
based on a binary-class classification problem with two classes: normal and 
myocardial infarction (Rajkumar et al., 2019). 
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The two aspects of ECG data are the classification of the ECG signal and 
the classification of individual ECG beats. Many waves, including the P, Q, 
R, S, T, and U waves, are found to be present in a single cardiac cycle, which 
are frequently used to depict one ECG pulse. A single ECG signal is formed 
by combining thousands of ECG beats. 

Pre-processing, feature extraction method, and classification methods are 
all important elements in the ECG classification process. ECG signals can 
cause many types of noise (for example, baseline wander noise), which is the 
key factor that influences feature extraction and classification. 

The ECG aids in the provision of meaningful data that reveals both the 
occurrence of MI and its location. MI characteristics include ST-segment 
elevation, T-wave inversion, and abnormal Q-wave appearance. These are 
mostly inferred by the feature vector classification method. Variability among 
ECG features, a lack of standardization of ECG features, individuality of 
ECG features, a lack of optimal classification rules for ECG classification, 
variability in patients’ ECG waveforms, and, finally, the selection of the 
most appropriate classifier are all major issues in ECG classification. These 
descriptions of the issues would be beneficial to the new research. According 
to Shweta H. Jambukia (2015), these descriptions of the issues will assist new 
researchers in recognizing the risks associated with ECG classification. 


1. Slight variation of ECG elements: The issue concerns electrocardiogram 
pattern margins; the amplitude domain is not found to be standard, 
fixed, or heuristically on time. The feature extraction approach partially 
chooses ECG elements, and its reliability is primarily dependent on 
these discovered features. On big datasets, a small deviation in these 
crucial properties can lead to misclassification. 

2. Diversity of ECG elements: The heart rate of a person will change 
depending on biological and physiological variables. An increase in 
heart rate might be caused by stress, exercise, enthusiasm, or other work- 
related activities. The RR interval, QT interval, and PR interval all alter 
as a result of changes in heart rate. These qualities must be rigorously 
modified, and the influence of the variable heart rate must be removed. 

3. Uniqueness of ECG sequences: The potential for intraclass similarity 
and interclass diversity of testing patterns acquired from ECG data is 
denoted by the uniqueness of ECG patterns. It is useful to demonstrate 
how far the ECG patterns can be scaled in a large enough dataset. 

4. The absence of optimal categorization means that ECG classification 
does not exist. 


7.3 The Normal ECG Signal 183 


5. Patients? ECG waveforms may differ: Diverse patients? ECGs may have 
various slops of signal, amplitude, and timing, resulting in a shift in the 
ECG waveforms. When processing the classification, it is vital to treat 
and classify the ECG signal with caution. 

6. Changes in heartbeat in a single ECG: Thousands of beats can be found 
in a single ECG, yet these beats are classified as distinct types (i.e., 
myocardial infarction, arrhythmia, etc.). The classification model should 
be trained in such a way that just a few minor errors are discovered on 
the test dataset. 

7. Looking for the best classifier that can categorize MI in real time, which 
can be difficult because classification accuracy is affected by a variety 
of factors. 


The best characteristic that fabricates a flawless model for clustering P-QRS- 
T waves contained in ECG signals that detect the myocardial infarction was 
proposed in this work. 

The framework of the ECG signal classification system consists of pre- 
processing, feature extraction, feature selection, and classification. 


7.3 The Normal ECG Signal 


A normal ECG is made up of the ECG signal in its natural state waves, 
segments, intervals, and complex, which are addressed in the following 
sections, with a graphical representation in Figure 7.1. 


e Wave: A wave is considered as a +ve or —ve deflection beginning at the 
baseline and representing a precise electrical activity. The ECG waves 
are the P wave, R wave, Q wave, T wave, S wave, and ultimately the U 
wave. 

e Interval: The span of time that elapses between two separate ECG 
events. 


On an electrocardiogram, the PR interval, QRS interval (also known as QRS 
duration), RR interval, and QT interval are all routinely determined intervals. 


V Segment: The distance between two separate ECG regions with the same 
amplitude level (not —ve or +ve). On an ECG, there are three segments: 
the PR segment, the TP segment, and the ST segment. 

y Complex: A collection of numerous waves that have been clustered 
together. The QRS complex is the sole important complex shown on 
an ECG. 
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Figure 7.1 The human ECG signal during cardiac cycle. 


V Point: On an ECG, there is just one point that indicates the J point, which 
is when the QRS interval stops and the ST interval begins. 

V The depolarization of the atria is depicted by the P wave. During normal 
atrial depolarization, the electrical vector passes through the SA node 
on its route to the AV node. After rising from the right atrium, it moves 
from the right to the left atrium. Finally, when seen on an ECG, it is 
designated as a P wave. It has an 80-ms period. 


The spreading depolarization is seen in the QRS complex’s right and left 
ventricles. The ventricles are described as having significant muscle mass 
related to the atria, and the QRS complex has significantly bigger amplitude 
than the P wave. A typical QRS complex lasts 80-100 ms. According to 
convention, any set of these waves is referred to as a QRS complex. The 
T wave, on the other hand, signals ventricular repolarization. 
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e The absolute refractory period is the time interval between the beginning 
of the QRS complex and the peak of the T wave. The second half of the T 
wave is defined as the relative refractory period. The T wave is typically 
characterized as positive, with a period of 150-350 ms. 

The period between an R wave and the following R wave is denoted by 
the RR interval; the usual resting heart rate is between 60 and 100 beats 
per minute. It lasts from 0.6 and 1.2 seconds on average. 

The PR interval is limited between the performance of the P wave and 
the onset of the effective QRS complex. The PR interlude represents the 
time it takes an electrical impulse to go from the sinus node to the AV 
node and then to the ventricles. 

This could be the most accurate evaluation of the AV node process. It 
has a duration of 120-200 ms. 

The PR segment connects the P wave to the other QRS complex waves. 
From the AV node through the Bundle divisions, and eventually to 
the fiber walls, a different sort of impulse vector is raised. This form 
of electrical action does not immediately cause a decrease and simply 
moves toward the ventricle portions. This depicts the flat on the ECG 
and its duration, which is approximately 50-120 ms. 

J point is measured from the point at which the QRS complex terminates 
its process and the ST segment begins. It is commonly used to assess the 
degree of ST progression or depression. 

The ST segment connects the QRS complex and the crucial T wave. The 
ST segment is denoted here to highlight the period during which the 
ventricles are depolarized. The deviation at the J point with regard and 
its relationship to the isoelectric line are used to determine or calculate 
the ST amplitude (PQ or TP segment). From a clinical aspect, it may 
result in positive or negative deviations belonging to the ST segment 
that are greater than 1-2 mm, indicating the presence of myocardial 
ischemia. 

ST interval is estimated from the J point to the T wave’s end point. It has 
a time interval of about 320 ms. 

QT interval is computed from the start of the QRS complex to the end 
of the T wave. A prolonged QT interval is a risk factor for ventricular 
arrhythmia, which can lead to sudden death. It varies according to heart 
rate. It requires a change for clinical studies, giving the QT = (QT 
interval/sqrt (R-R interval)). It can last up to 420 ms for a heart rate 
of 60 beats per minute. 
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7.3.1 ECG Features 


The ECG wave has a variety of characteristics. Each one is quite reliable 
for identifying and documenting the current state or condition of the heart. 
Starting from lead II of a conventional subject, a separate ECG wave signal. 
The ECG has a distinct morphology consisting of three waves: the P wave, 
the QRS complex wave, and the T wave. Waves, complexes, each segment, 
and intervals are all components of the standard ECG signal, which can be 
examined by comparing the voltage on the vertical axis to the time on the 
horizontal axis. It is possible to witness a single ECG waveform that begins 
and ends at the isoelectric line. 

A segment is a line that is flat, straight, and isoelectric. There are two 
waveforms that are collectively referred to be complex. Interval refers to a 
waveform or complex that is linked to a segment. Positive deflections are 
defined as ECG tracings that are presented or positioned above the isoelectric 
line, whilst negative deflections are defined as tracings that are presented or 
located below the isoelectric line, as stated by Maharaj et al. (2014). 


7.3.2 12-Lead ECG System 


A typical ECG has 12 leads, according to Rajesh et al. (2021). Because the 
leads are linked to the individual’s arms and/or legs, they are referred to as 
“limb leads.” Because they are positioned on the torso, the remaining six 
leads are referred to as “pre-cordial leads” (precordium). As a result, each of 
the six limbs has its own lead, which is labeled lead I, II, III, aVL, aVR, and 
ultimately aVF. The letter “a” stands for “augmented” because these leads are 
a combination of leads I, II, and III. The letters V1, V2, V3, V4, V5, and V6 
represent the six remaining pre-cordial leads. This is a normal 12-lead ECG 
trace. 

A lead is a pair of electrodes that are placed on the human body at precise 
anatomical places in order to record each node of electrical activity or heart 
function. Each lead is equipped with two electrodes: a negative (-) and a 
positive (+). The conventional 12-lead ECG system includes three bipolar 
leads, three improved unipolar leads, and six chests known as pre-cordial 
leads, as shown in Figure 1.5 and Table 7.1. The polarity of electrodes can be 
changed using the “lead selector” on the ECG equipment. This helps in the 
creation of distinct lead selections with merits without the need to physically 
move these groups of lead wires or electrodes. The following is a summary 
of ECG features as defined by Xue et al. (2020). 
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Table 7.1 Standard 12-lead ECG system description 


Standard leads Limb Leads Chest leads 
Bipolar Leads Unipolar Leads Unipolar leads 

Lead I aVR V1 
Lead II aVL V2 
Lead III Avf V3 

v4 

V5 

V6 


Following are the detailed descriptions of the 12-lead systems: 


/ Bipolar leads capture variations in both the positive and negative pole 
potentials. 

\/ Unipolar leads have a single probing electrode for measuring electrical 
potential. 

y Unipolar limb leads that record potentials between the aVR, aVL, and 
aVF. 

y Einthoven seizes the initiative. 

y Lead III is used to record potentials between the left arm and left leg. 

y Lead II, which catches potentials between the right and left arms. 

V Lead I tracks potentials in both the left and right arms. 


Chest leads are made up of the following components: 


e V6: Those shown in the midaxillary line. 

e V5: The V5 rib is positioned on the anterior axillary line. 

e V4: In the fifth intercostal gap, in the midclavicular line. 

e V3: A voltage increases between the second and fourth electrodes. 
e V2 is on the left sternal margin of the fourth intercostal gap. 

e VI: Located on the 4th intercostal space’s right sternal edge. 


Only a few electrodes are required to study the heart rhythm; typically, 
ten electrodes are utilized for acquisition when multiple waveforms and 
morphological information are required. 


7.4 Proposed Methodology 


The proposed ECG classification framework is based on deep learning tech- 
niques like native CNN, LSTM, and the proposed EDN algorithms, as shown 
in the overall architecture of Figure 7.2. 
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Figure 7.2 Overall framework of the proposed ECG classification system. 


When there is noise in the signal, it becomes distorted. As a result, the 
accuracy of the signal’s feature selection and classification degrades. 


7.4.1 Phase I: Pre-Processing 


The presence of noise has an effect on the signal. As a result, the signal’s 
accuracy in feature selection and classification suffers. It is required to 
remove noise from the ECG wave signal during the pre-processing stage. 


ECG Signal Filtering 


Filters on ECGs are necessary for removing artifacts; however, their incorrect 
use may result in misdiagnosis. 


A. Low-Pass Filters 


Low-pass filters in ECG signals minimize high-frequency noise. In the noisy 
signal, the smoothed ECG signal is blended with high-frequency noise [9]. 


B. High-Pass Filter 
In ECG signals, high-pass filters reduce low-frequency noise. 
C. Base Derivative Filters 


With slope computations, derivative filters are frequently utilized. The 
derivatives of the received signals represent significant quantities. 
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Figure 7.3 Comparison of original signal and filtered signal. 


7.4.2 Phase Il: Feature Extraction 


The frequency domain and temporal domain of the ECG signal have distinct 
characteristics. To extract features, first-order and higher-order statistics are 


Mean Variance  Skewness Kurtosis 


72.763190 2841.590151 5.441525 36.595744 
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successfully used. First-order statistical features like mean and variance 
are retrieved, as well as higher-order statistical features like skewness and 
kurtosis. 


7.4.3 Phase Ill: Feature Selection 


Multiple ECG beats, each containing P waves, QRS complexes, and T waves, 
make up an ECG signal. Intervals such as “PR,” “RR,” “QRS,” “ST, and 
“QT,” as well as peaks such as P, Q, R, S, and T of ECG signals with their 
usual amplitude or duration values, are eliminated. The different forms of 
ECG features are segments, intervals, and peaks. 


Raw signal 


Time [sec] 


Figure 7.5 ST detection. 
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Figure 7.6 QRS detection. 
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Figure 7.7 R peak detection. 


7.5 ECG Classification Using Deep Learning Techniques 


The major purpose of this study is to enhance the features of an automated 
model that may be used on mobile healthcare devices to diagnose MI using 
ECG data. When using the machine learning method, the features were 
frequently extracted from raw ECG data using traditional methods or by 
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assuming the features using advanced machine learning models. According 
to Chen er al. (2020), deep learning methods are used to extract hidden 
information from raw data, which will be useful for data classification. This 
research effort provides an EDN model that is ideal for this purpose. 

Deep neural networks have progressed to the point where they can now 
accurately analyze a signal. The DNN algorithm and the deep learning 
approach are useful for classifying data and determining if the data has MI or 
is healthy. CNN and LSTM are examples of classifiers. 


7.5.1 CNN 


CNN is called feature learners and have a large capacity for automatically 
extracting appropriate features from raw input data. CNNs are composed of 
two main components, each of which performs a distinct purpose. According 
to Baloglu et al. (2019), the feature extractor is the first component in CNNs, 
and it is generally in charge of ensuring that the features are extracted 
automatically. The classifier, also known as a fully linked network and a 
multi-layer perceptron (MLP), is discussed in the second part. 

The fully linked segment is in charge of classifying data based on the 
learned features collected from the first phase. Two common layers are 
included in the feature extraction section: a convolutional and a pooling. The 
convolutional layer can extract feature maps from the previous layer. The 
convolutional layer is composed of several convolution kernels, also known as 
filters, which are added by bias and then employed in the activation function 
to extract a feature map that will be used in the following layer. One of the 
most common sub-layers encountered in the feature extraction section is the 
pooling layer. 

The pooling layer is divided into three types: max, min, and average 
pooling. The feature maps’ resolution is reduced as a result of the pooling 
layer. The suggested model also includes a max-pooling function, which aids 
in determining the largest value discovered in a set of close inputs. However, 
when considering in terms of complexity, there is a distinction between 
the two. When compared to the completely connected layers, the feature 
extraction section, which is called the primary layer, helps to function more 
calculations that comprise feature extraction and feature selection procedures. 

Several comparable properties between CNN structure and ANN struc- 
ture with input layer, hidden layer, and output layer can be discovered in 
Liu et al. (2018). CNN is sometimes known as the developed form of the 
ANN since, unlike NN, it is translational and shift invariant. CNNs are 
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frequently made up of a variety of layers, including input, average pooling, 
drop layer, convolution, max-pooling, and softmax layer, among others, and 
play a vital part in the model. The role of feature extraction is critical for 
automatically obtaining useful characteristics from ECG signals, while the 
classification part is responsible for accurately classifying signals using the 
extracted features. 


The fundamental components for feature extraction, as stated above, pri- 
marily comprise two layers: convolutional layers and downsampling layers. 
Convolutional layers, commonly known as C-layers, are a type of fuzzy 
filter that helps to differentiate the characteristics of original signals while 
simultaneously reducing noise. Convolution is performed between the higher 
layer’s feature vectors and the current layer’s convolution kernel. Finally, 
according to Li et al. (2017), the activation function contributes to the 
output of convolution calculation results. Eqn (7.1) describes the result of 
the convolutional layer: 


zi = f( > a1 «Ww +0!) (7.1) 
ieM; 
where 1; x is the characteristic vector corresponding to the jth convolution 
kernel of the /th layer, and “M” is the current neuron’s perceptive component, 
and /;; and W are the bias coefficients corresponding to the jth convolution 
kernel of the /th layer, and f is a nonlinear function (2018). 


Concurrently, pooling technology is used to store features classified using 
three functions: displacement, scaling, and invariance. The downsampling 
layer contains the function for further feature extraction; however, the spatial 
resolution found between the hidden layer is diminishing, and its formula is 
defined in the following equation: 


X} = f(Bjdown Xp + b) (1.2) 


Here, down() denotes the downsampling function, /; denotes the weight- 
ing coefficient, and 1; b denotes the bias coefficient. There are a number of 
input and output layers in addition to the C-layers and S-layers. The ECG 
signal is split as input data before training the network model, and the target’s 
output vector is also mentioned. Eqn (7.3) is used to identify and analyze the 
error, which is then compared to the provided goal output vector. 


n—1 


1 
B= 72 (dem ja (7.3) 
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Figure 7.8 CNN architecture for ECG classification. 


where E stands for the total error function, k y stands for the output vec- 
tor, and k d stands for the goal vector. If the training is convincing, it is 
completed; in the meantime, the weights and thresholds are saved. Each 
weight is specified to be constant, and the classifier is built. Similarly, if the 
stage is not reached, the process would be repeated. Convolutional neural 
networks were designed to deal with two-dimensional data. However, it is 
mostly used to deal with one-dimensional data. As a result, the CNN model’s 
structure must be managed. Figure 7.8 depicts the CNN model used for ECG 
categorization. 

An input layer, two convolutional layers, a full connection layer, two 
downsampling layers, and an output layer are the four primary layers in total. 
According to Li et al. (2000), the input of each neuron is closely related to 
the output of the previous layer, which is primarily utilized to retrieve local 
information. In the convolutional layer C1, 18 convolution kernels with a 
length of 7 sampling points are produced, and the input is a 130-sample-point 
segment of ECG data. It also generates 18 feature vectors and 124 sampling 
points. 

The C1 layer’s feature vectors are pooled in the S1 layer, yielding 62 
sample points; the C2 layer includes 18 convolution kernels with a length of 
7 sampling points, and its output is said to contain 324 feature vectors with 
56 sampling points. When the feature vectors are pooled again with the S2 
layer, their length is distributed into 28 sampling points, which are then sent 
to the output layer for classification results evaluation. 


7.5 ECG Classification Using Deep Learning Techniques 195 


7.5.2 LSTM 


RNN is important for assessing the underlying representation of a time- 
varying signal, and it is one of the network topologies developed to address 
the sequential problem, also known as sequence classification. 

Hochreiter and Schmidhuber introduced a significant upgrade to RNN in 
1997, which they dubbed the long short-term memory (LSTM). As a result, 
numerous LSTM applications have lately been improved. 

LSTMs are the most widely used and are one of the types of RNN 
architecture. As a result, LSTM networks (units) are an important structure 
that contains memory blocks and memory cells, as well as gate units. The use 
of multiplicative input gate units can assist in defending against the negative 
effects caused by unrelated inputs (Zhang et al., 2019). According to Kumar 
et al. (2021), the input gate controls data flow into the memory cell, while the 
output gate controls data flow from the memory cell to other LSTM blocks. 

The data flow in an LSTM cell is controlled by the read gate (denoted 
as “i,”), forget gate (denoted as “f,’”’), and write gate (denoted as “o,”). As 
a result, the hidden variables and cell state of the LSTM at time-stamp 1 are 
denoted as h; and c+, respectively. The letter h, represents the output of an 
LSTM cell. x; represents the input vector to an LSTM cell at time stamp !. 
Vanilla LSTM use x; + L as the last input to an LSTM cell, where L denotes 
the number of segments in an ECG sequence used for recognition. This is 
depicted in Figure 7.9. 

However, it was subsequently found that the final time-stamp output may 
not precisely describe all of the previously acquired data. Furthermore, an 
error arises during the back-propagation function due to the first time-stamp, 
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Figure 7.9 Architecture of LSTM for ECG classification. 
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which may become unimportant (Shadmand et al., 2016). Rather than using 
the hidden variable as the output, we have used this work to characterize the 
output at each time-stamp. At each time-stamp, a dropout layer captures and 
passes the output of the LSTM cell. The output of the dropout layer is then 
routed through a fully linked layer. 

A softmax function is employed in the final portion to calculate the 
likelihood of an ECG sequence. Each time interval’s values are stored in the 
cell unit. The rest unit gates control the flow of data into and out of the unit. 
In the memory block structure, the forget gate is controlled by an efficient 
network known as a simple one-layer neural network. The following equation 
can be used to calculate the functions of this gate: 


fe = BOW e, he_-1, Ce-a] + bp) (7.4) 


in which x; is the input sequence, c;__ is the previous LSTM block memory, 
and h;—_ is the earlier block output The bias vector is denoted by bp, the 
fundamental logistic sigmoid function by, and the various weight vectors used 
for each input by Darmawahyni et al. (2019). An input gate is a unit that 
constructs new memory based on the previous memory block effect by using 
a simple NN with an activation function denoted as tanh. As a result, eqn 
(7.5) is employed in order to evaluate these operations [see eqn (7.6)] 


“= ô( W Xz, hi1) Ci-1 ] T bi) (7.5) 


Ci = fe. Ci-1+H. tanh(W |Xi, hi1, Cr ] + 4) (7.6) 


The output gate serves as an output of the present LSTM block and can 
be created using the following equations: 


dt = SW [X+, ht-1, Ct] + bo) (7.7) 
ht = Ôt tanh(C,) (7.8) 


where b; and b; are the previous memory block’s outputs; these denoted units 
are connected to one another, as shown in Figure 7.9, allowing the data to 
cycle between adjacent time steps and assisting in the provision of an inner 
feedback state, which is the network to the temporal feature in the provided 
data. 

The forget gate in the memory block structure is controlled by a sim- 
ple one-layer neural network, where x; represents the input sequence, hy_1 
represents the previous block output, C;_1 represents the previous LSTM 
block memory, and by represents the bias vector. The letter “W” signifies the 
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independent weight vectors for each input in the logistic sigmoid function. 
The sigmoid activation function, which is considered the output of the forget 
gate, is used for the preceding memory block by implementing element-wise 
multiplication. As a result, the quantity of memory consumed in the previous 
memory block will be determined. 

If the activation output vector yields values close to zero, the preceding 
memory will be deleted. The input gate, on the other hand, is a sector where 
new memory is formed by combining the preceding memory block effect 
with a basic NN with the tanh activation function. These functions are then 
assessed, and the resulting output gate is passed into another section, which 
generates the output of the current LSTM block. 


7.5.3 Enhanced Deep Neural Network (EDN) 


MI is distinguished by irregular and unpredictable beats that can be single or 
many. As a result, the suggested network should be able to handle signals of 
different lengths. These characteristics are well addressed by the established 
proposed EDN framework. 

The LSTM unit functions by combining a “memory” cell with a gating 
mechanism comprised of three nonlinear gates: an input gate, an output gate, 
and a forget gate. The objective of the gate is to control the flow of signals 
into and out of the cell in order to ensure successful RNN training and to 
regulate long-term dependence (Feng er al., 2019). Since its inception, the 
LSTM unit has undergone numerous modifications to improve performance. 
The number of components in the LSTM architecture can be increased to 
improve performance. 

As a result, CNN and LSTM approaches are primarily used in the sug- 
gested enhanced deep neural network model. Incorporating extra components 
into the LSTM design will improve performance. This model’s input layer 
now contains both healthy and MI data. They are let in via hierarchically 
organized EDN and dropout layers, which are subsequently translated into 
various-sized feature maps. Class prediction can only be coded using the 
thick layer. The dropout method prevents the model from overfitting during 
training. 

According to Manimekalai et al. (2020), the model inspects the entire 
training dataset at each epoch. If the epoch number is large enough, a 
model can remember the training data. Vector operations such as matrix 
multiplication and gradient descent are advantageous for huge matrices of 
data that are processed in parallel using GPU. The bias of h is frequently 
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included to the cell state vector in this manner, which helps to increase 
performance efficiency. Because the output gate is regarded as less important 
than the forget gate and the input gate, the recommended approach aids in 
the adjustment of the hidden state vector by implying point-wise Hadamard 
multiplication of the prior output gate parameter and previous cell state 
vector. Because of the parallelism, EDN has a faster progressive performance 
time. Data that helps each vector in the hidden state can be implicitly stored 
on a previous cell state unit. Convolutional neural networks (CNN) are used 
to create an effective neighborhood identification process. 

The CNN and LSTM algorithms are used in the proposed enhanced deep 
neural network (EDN) framework. The number of components in the LSTM 
architecture may be considered to improve performance. The input layer of 
this model receives both healthy and MI information. They are transformed 
into feature maps of varied sizes after passing through EDN and dropout 
layers in a hierarchical order. The dense layer then provides class prediction 
automatically. During training, the dropout method protects the model from 
overfitting. At each epoch, the model examines the complete training dataset. 
If the epoch number is large enough, a model can memorize the training 
data. Vector operations on large data matrices, such as matrix multiplica- 
tion and gradient descent, are performed in parallel with GPU support. In 
this approach, the bias of h is added to the cell state vector to improve 
performance. The proposed EDN framework is depicted in Figure 7.10. 

Because the output gate is less critical than the input and forget gates, 
the proposed technique updated the hidden state vector by adding point-wise 
Hadamard multiplication among the prior output gate parameter and previous 
cell state vector. Because of parallelism, EDN minimizes the time it takes 
for a process to run. It has the capability of processing data in such a way 
that each vector in the hidden state is implicitly dependent on a preceding 
cell state unit. The proposed model employs CNN to improve the efficiency 
of the neighborhood identification process. The EDN model implementation 
reduces training time when compared to an LSTM model. 

The EDN model application reduces training time when compared to an 
LSTM model (Fu et al., 2020). The stateful LSTM units in the model are 
handled by a fully connected softmax layer, resulting in a possible distribution 
over system call integers. The functional LSTM and CNN models were both 
considerably more successful on ECG signals. 

To increase the performance of the suggested methodology, the bias of 
h was added to the cell state vector. Because it was less important than the 
input and forget gates, the output gate was deleted. It changed the hidden state 
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Figure 7.10 Framework for proposed EDN. 


vector by combining the prior output gate parameter and the preceding cell 
state vector using point-wise Hadamard multiplication. 


zi = IO ri! x ki + bi) (7.9) 
ieM; 
în = 0 (West: + Wishi-1 + Weice_-1 + bi) (7.10) 


fi = o(Wa pte + Wrpht_1 + Wepee_-1 + bf) (7.11) 
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OL = O(WroXt + Whohi-1 + Weott-1 + bo) (7.12) 
Ch = tanh(Wez+ + re(Weht-1 + ben) + bn) (7.13) 
hi = (1 — ot * Cy + ot * C¿-1) (7.14) 


where xj; is the m-dimensional input vector, i; is the input gate at time t, and 
fi is the forget gate at time t, vectors utilizing the sigmoid function of point- 
wise multiplication vectors. O+ is the output gate at time t. These vectors 
for the input gate, forget gate, and output gate are all n-dimensional. C is 
the cell state vector, which concatenates vectors using tanh activation. It is 
the n-dimensional activation unit for cell state. h+ is the hidden state vector, 
which employs the point-wise Hadamard multiplication operator. Each was 
determined using an equation ranging from (7.9) to (7.14). 

It may process input sequentially, which allows each vector in the hidden 
state to be implicitly dependent on the previous cell state unit. EDN employs 
convolutional neural networks to improve the efficiency of the neighborhood 
identification process. 

As a result, it is argued that no extraction from hand-crafted features 
is required in deep learning models, making them very simple to use. As 
a result, the properties of these two algorithms were implemented in this 
research effort in order to function the diagnosis of myocardial infarctions. 

Each convolution operation is carried out by moving the kernel across 
a section of the input vector one sample at a period, multiplying and then 
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Figure 7.11 Architecture of EDN for ECG classification. 
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Algorithm EDN: Enhanced Deep Neural Network 


Input: x} =(x + x} + x} ++ +xf) -a Chain of Independent Variables 
D — Denotes the number of memory blocks. 
Sj — the number of cells in Block j. 
Step 1: Read the data, then calculate the standard deviation and separate the data. 
Step 2: Calculate the total number of independent variables. 
Fix(log3(data size)) Mapsize = fix(log3(data size)) -1 
Step 3: Create CNN layers by combining Input Layers and Subsampling Layers. 
at = (Weir ati fut) 
xf = f (Eien; xi > ki, + bi) 
Step 4: Send the CNN output layers units as Input vectors to the LSTM unit's Input 
gate, Forget Gate, and Output Gate. 
Step 5: For each memory block, compute the Input, Forget, and Output gates for j=1 
to D. 

Evaluate the Input Gate: i, = o(WyiXe + Whihe -1 + WeiCt-1 + bi) 

Evaluate the Forget Gate: f, = o(Wypx_ + Wuphe -1 + Wop Ce-1 + dy) 

Evaluate the Forget Gate: 0, = o(WyoX_ + Whohr -1 + WeoCr-1 + bo) 

for V =1 to Sj do 

C, = tanh(W,x, + re(Wehe -1 + Don) + br) 
Finally, compute to update the hidden state. 
Assess the Hidden State: h, = ( 1 — 0,) * Ce + o, * Ce-a 
End for 

Step 6: EDN Layers should be returned. 


adding the values of the superimposed matrices (Shu er al., 2018). In order 
to achieve the relevant spatial information identified in the provided data, the 
weights of kernel k must be regularly updated by using the network during the 
training stage. As a result, rather than using a suitable convolution approach, 
the proposed work employs full convolution characteristics. 

This structure was recommended because the entire convolution is made 
up of shorter length segments that are already padded with zeros. Further- 
more, no bias is used or added during the convolution method function to 
ensure the integrity of the zero padding. Because it is derived from the 
hypothesized convolutional layer, the output of 10 is believed to be a zero 
padded sequence. Use the size 2 max-pooling filter technique with non- 
overlapping stride to cut the size of the input description in half, as seen in 
the feature maps after each convolutional layer is applied. 

In the following part, the necessary LSTM layer is inferred to extract 
temporal data from these feature maps. The features extracted using the two 
most commonly used methods, convolution and pooling, are then divided 
into sequential components and fed into the commonly used LSTM units 
for temporal analysis. However, in the entirely linked layer that predicts 
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myocardial infarction, just the output from the LSTM’s last phase, as stated 
above, is used. 

In comparison to other conventional investigations, the proposed EDN 
model has delivered efficient findings. By removing the recurrent connec- 
tion network and dense connections from the network, high classification 
accuracy of 87.80% can be achieved. The CNN is considered to be better 
at obtaining spatial features, whereas the LSTM is said to be effective in 
learning temporal features and to be more efficient at doing so. 

By combining these two modalities, the technology was able to increase 
not only diagnosis accuracy but also the features of the model used in classify- 
ing cardiac signals with various sequence lengths. While the introduced deep 
learning network has trained from start to finish utilizing noisy ECG signals, 
pre-processing methods such as noise reduction and feature extraction are not 
necessary. 

Assuming that the test segment contains just one type of myocardial 
infarction model, the suggested system can classify ECG signals of varying 
lengths. However, this is not always true because the ECG signal seen in the 
real world may comprise a range of distinct forms of myocardial infarction. In 
the future, this research might be enhanced to include the use of the featured 
auto-encoder network on these ECG data to examine element-wise analysis 
by linking or comparing each pixel to a class label. 

This will be able to parallelize the beat detection function while also 
including the categorization technique as a result of this. High-end graphics 
cards are necessary to accelerate the model’s training process. Instead of 
using weighted loss for training, data augmentation 21 can be utilized to 
increase training data variability and alleviate the increasing class imbalance 
problem. 


7.6 Experimental Results 


The PTB database was employed in this study, with 80% of the data being 
imposed for training purposes. Table 7.3 lists the enduring 20% of data that 
were used for testing purposes. PTB dataset has utilized to identify 148 MI 
patients’ records and 52 normal patients’ records for this study, with 118 MI 
recordings and 41 normal recordings used for training and 30 MI recordings 
and 11 normal recordings used for testing. 

The three deep learning models, CNN, LSTM, and EDN, were imple- 
mented in the Google Open Source Research Laboratory (COLAB) using 
Python code. It is an Internet tool that is free to use. The healthy signal 
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Table 7.2 PTB dataset used for training and testing purpose 


ECG recordings in PTB dataset 
MI Normal 
Total 148 52 
Training 118 41 
Testing 30 11 


and MI signal were classified from the input ECG signals using an ECG 
signal classification system that was created, trained, and tested, and the 
experimental findings were presented here. 

Table 7.4 shows the ECG classification system’s confusion matrix based 
on the test dataset. Beginning with the raw signal and ending with the final 
stage, which included the use of a combination of filtering approaches, feature 
extraction and feature selection, classification methods like CNN, LSTM, 
and EDN were used in a series of steps. Precision, sensitivity, specificity, 
and accuracy were determined using the classification approaches’ confusion 
matrix to calculate the performance of the ECG classification system. 

Deep learning system contains its own interconnected neurons which help 
to send and receive messages between each other. These interconnections 
between neurons are assigned with weights, which signify a network state and 
are reorganized during the learning process. A feed-forward neural network 
consists of 10 hidden layers that are used for the classification of myocardial 
infarction in this research work. As a result, this deep neural network was 
implemented on a Python notebook in Google Co-laboratory, and the number 
of neurons present in each hidden layer was limited to 50, requiring this 
network to be trained utilizing a GPU-based system. An activation works 
upon the rectified linear unit (ReLU) that is implied for the hidden layers 
and a sigmoid function was implied at the output layer. Back-propagation 
which contains stochastic gradient decay is useful for generating the network 
weights. Thus, the learning rate was enhanced by implying grid search for 
accuracy and to reduce overfitting. 


7.6.1 Performance Evaluation 


The performance level of the ECG signal classification system depended on 
numerous important factors including dataset used for experimental purpose, 
filtering process for noise removal, identifying and selecting important fea- 
tures that are present in the data, and specific classification methods that suit 
as well as give better results. 
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Table 7.3 Confusion matrix for the ECG classification system 


Actual class 
Abnormal Normal 
Raw signal + CNN Predicted class | Abnormal 23 6 
Normal 7 5 
Actual class 
Abnormal Normal 
Raw signal + LSTM Predicted class | Abnormal 23 7 
Normal 7 4 
Actual class 
Abnormal Normal 
Raw signal + EDN Predicted class | Abnormal 24 7 
Normal 5 5 
Actual class 
Abnormal Normal 
Filtered signal + CNN Predicted class | Abnormal 25 5 
Normal 4 6 
Actual class 
Abnormal Normal 
Filtered signal + LSTM Predicted class | Abnormal 24 4 
Normal 5 7 
Actual class 
Abnormal Normal 
Filtered signal + EDN Predicted class | Abnormal 26 5 
Normal 4 6 
Actual class 
Filtered signal Abnormal Normal 
with selected Predicted class | Abnormal 28 3 
features + CNN Normal 4 6 
Actual class 
Filtered signal Abnormal Normal 
with selected Predicted class | Abnormal 27 3 
features + LSTM Normal 3 8 
Actual class 
Filtered signal Abnormal Normal 
with selected Predicted class | Abnormal 29 4 
features + EDN Normal 1 7 


7.6.2 Evaluation Metrics 


The estimate metrics were used in the experiments to estimate the per- 
formance of ECG classification approaches; the performance of the ECG 
classification system will be evaluated based on the results of these experi- 
ments. Precision, recall, F1 measure, Cohen kappa coefficient, and accuracy 
were the metrics used to measure the system’s performance in this research 
work. 

Basically, the confusion matrix is used to measure the performance of an 
algorithm. Confusion matrix contains these four values from the actual data 
and the predicted data. 
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e True negative (TN) means number of patients without disease, who 
shows a negative test result with the assay in question. 

e True positive (TP) means number of patients with the disease, who 
shows a positive test result with the assay in question. 

e False negative (FN) means number of patients with the disease, who 
shows a negative test result with the assay in question. 

e False positive (FP) means number of patients not affected by the disease, 
who shows a positive test result with the assay in question. 


Precision defines number of the proportion of patients with a positive test 
result which describes the patients do have the disease. It is denoted in eqn 
(7.15). 


TP 
Precision == (TP+FP) (7.15) 
Recall defines number of the proportion of patients affected by the disease 
who get a positive test result with the assay in question. It is denoted in eqn 
(7.16). 
TP 
Recall = ——__ 7.16 
ae" “CEP LEN ) 710) 
F1 measure defines twice the number of the proportion of patients affected 
by a disease and the number of the proportion of patients with a positive test 
result with the assay in question. It is denoted in eqn (7.17). 


Fl = a (7.17) 
measure = TP E FP ic FN A 


Cohen kappa coefficient defines the probability of actual class minus the 
probability of predicted class divided by 1 minus the probability of predicted 
class. It is denoted in eqn (7.18). 


Po—Pe 
1-P: 
where Po is the probability of the actual class and Pe is the probability of the 

predicted class. 
Accuracy defines the number of correctly classified data such as normal 
and abnormal, to the total number of classified results. Accuracy describes the 


correctness of the measurement that predicts the correct value. It is denoted 
in eqn (7.19). 


CKC = 


(7.18) 
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TP+TN 
(TP+TN+FP-+EN ) 


Accuracy = (7.19) 


7.7 Results and Discussion 


The three deep learning models, CNN, LSTM, and EDN, were implemented 
in the Google Open Source Research Laboratory, or COLAB, using Python 
code. It is an open source online tool. ECG signal classification system was 
generated, trained, and tested to classify the healthy signal and the MI signal 
from the input ECG signals and the experimental results were discussed here. 

Table 7.5 clearly compares ECG signal categorization methods based 
on performance metrics precision. Based on the resultant data obtained in 
Table 7.5, it is clear that the filtered signal with selected features along with 
EDN performance is significant while comparing with combinations of CNN 
and LSTM algorithms. 

Figure 7.12 illustrates a graphical picture of the comparison of ECG 
classification methods using CNN, LSTM, and EDN, beginning with the 
combination of raw ECG signals and progressing to pre-processed and 
selected features to classify the ECG signals. By comparing the performance 
of the ECG classification system in terms of precision, it is clear that the 
filtered signal with selected features along with EDN yields 96.67% precision 
while comparing with CNN and LSTM, which yields 90.32% and 90%, 
respectively. 

The comparison of ECG signal categorization models based on the per- 
formance metrics accuracy is clearly seen in Table 7.6. When comparing 
the filtered signal with selected features along with EDN performance to 


Table 7.4 Precision-based comparison of ECG classification methods 


Methods Precision (%) 

Raw signal + CNN 79.31 

Raw signal + LSTM 76.67 

Raw signal + EDN 77.42 

Filtered signal + CNN 83.33 

Filtered signal + LSTM 85.71 

Filtered signal + EDN 83.87 

Filtered signal with selected features + CNN 90.32 
Filtered signal with selected features + LSTM 90.00 
Filtered signal with selected features + EDN 96.67 
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Figure 7.12 Precision-based comparison of ECG classification methods. 


Table 7.5  Accuracy-based comparison of ECG classification methods methods 


Methods Accuracy 

Raw signal + CNN 68.29 

Raw signal + LSTM 65.85 

Raw signal + EDN 70.73 

Filtered signal + CNN 77.50 

Filtered signal + LSTM 77.50 

Filtered signal + EDN 78.05 

Filtered signal with selected features + CNN 82.93 
Filtered signal with selected features + LSTM 85.37 
Filtered signal with selected features + EDN 87.80 


combinations of CNN and LSTM algorithms, it is evident that the filtered 
signal with selected features along with EDN performance is substantial. 
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Figure 7.13  Accuracy-based comparison of ECG classification methods. 


Table 7.6 Techniques for ECG classification comparison 


Method CNN (%) LSTM (%) EDN (%) 
Precision 90.32 90.00 96.67 
Recall 87.50 90.00 96.67 
F1 measure 88.89 90.00 92.06 
Accuracy 82.93 85.37 87.80 


Figure 7.13 shows a graphical representation of the classification methods 
used to classify ECG signals: pre-processed signal with selected features + 
CNN, pre-processed signal with selected features + LSTM, and pre-processed 
signal with selected features + EDN. 

When comparing the accuracy of the ECG classification method, pre- 
processed signals with selected features + EDN scores 87.80%, whereas CNN 
and LSTM score 82.93% and 85.37%, respectively. The comparison of ECG 
categorization techniques is shown in Table 7.7. 

As shown in Table 7.7, the classification methods of pre-processed signal 
with selected features + CNN, pre-processed signal with selected features + 
LSTM, and pre-processed signal with selected features + EDN were used 
to classify the ECG signals in terms of precision, recall, F1 measure, and 
accuracy. When compared to CNN and LSTM, EDN outperforms them on all 
criteria Fu et al. (2020). 

Finally, the proposed algorithm is demonstrated in execution Fu ef al. 
(2020). The EDN model’s loss and accuracy for training and testing data 
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Figure 7.14 Loss and accuracy for training and testing data for EDN algorithm. 
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Figure 7.15 Comparison of models using ROC curve. 


are shown in Figure 7.14. The ROC curve represents the performance of the 
classification algorithm. Figure 7.15 depicts the precision—recall curve. 


7.8 Conclusion 


This chapter offers an outline of the classification strategies used to categorize 
ECG data as well as an explanation of why the recommended approach 
for classifying the ECG signal is necessary. Convolution neural network, 
LSTM, and EDN algorithms were used in the classification procedure. When 
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comparing ECG classification system accuracy, pre-processed signals with 
selected features + EDN obtain 87.80%, whereas CNN and LSTM achieve 
82.93% and 85.37%, respectively. The overall findings of the performance 
metrics reveal that the suggested EDN algorithm overtakes deep learning 
techniques such as CNN and LSTM when compared to the experimental 
results. According to the confusion matrices of the algorithms, the proposed 
model EDN has a higher accuracy of 87.80% for the PTB database. The 
recommended model displays performance improvisation based on metrics 
such as precision, recall, F1 measure, and Cohen kappa coefficient. These 
improvements to EDN’s performance would help to save human lives. 
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Abstract 


Image annotation is a tough venture in deep learning. Without the annotation, 
it is hard to perceive objects for machines. This chapter’s main goal is to 
improve a concept of an automatic annotation system that includes a pre- 
trained semantic segmentation model and MATLAB Image Labeler Tool. 
In this chapter, automatic annotation of synthetic aperture radar images of 
the oil spills is carried out using pixel-wise semantic segmentation that is 
perhaps the most mainstream errand in computer vision. Presently, deep- 
learning-based convolutional neural networks redriving significant advances 
in semantic segmentation due to their incredible feature representation. This 
proposed method includes a pre-trained DeepLabv3+ along with ResNet18 
as a backbone model to create an automation algorithm. DeepLabv3+ assigns 
different categories to each pixel in an input image. Image Labeler is used 
to create an automation algorithm for automatic labeling of oil spill images. 
The novelty of the article is due to adapting pre-trained DeepLabV3+ uses 
ResNet18 as the backbone for image annotation using the Image Labeler 
feature to improve the generalization ability of the system. Broad analyses on 
proposed semantic image segmentation division using ResNet18 as backbone 
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are performed over oil spill SAR dataset, and results are accurate when 
compared to Xception and MobileNet backbone model. 


Keywords: Synthetic aperture radar, deep learning, semantic segmentation, 
ResNet18, Xception, MobileNet computer vision. 


8.1 Introduction 


Image annotation is a complicated activity of detecting and classifying every 
object in a dataset. Although the process is very essential in some cases, 
the complexity in manual annotation limits the use of annotation and object 
detection in their tasks. So, a new automation algorithm is developed with the 
intention to do the most effective object detection and classification. Image 
annotation helps machines to recognize the varied objects through computer 
vision and understand like humans. It labels the data using the annotation 
algorithm to create a massive training dataset for machine learning. Without 
the annotation, it is difficult for machines to recognize the objects. Auto- 
matic image annotation (AIA) is one among the image retrieval techniques 
in which the images are annotated with semantic keywords automatically 
and then it will be easier to retrieve the images. Image annotation is the 
process of labeling photographs with a set of pre-set descriptions based on the 
image attribute. This can assist in bridging the gap between low-level visual 
features and the high-level meanings derived from the image. The core idea 
behind picture annotation is to extract semantic notions from a large number 
of sample photographs and apply them to new images automatically. The 
photos have already been labeled, allowing them to be found quickly using 
keywords. Due to diverse imaging settings, mind-boggling and difficult to 
depict objects, highly textured background, and occlusions, automatic picture 
annotation is a difficult process [23]. 
Problems and challenges of image annotation are mentioned below. 


e Manual annotation has traditionally been employed for databases with 
vast quantities of photos. Manual annotation, on the other hand, is an 
exceptionally tedious and exorbitant interaction for an enormous number 
of datasets. 

e Segmenting oil spills from an image having different classes is a crucial 
task. Usually, classes and class labels will be different from one class to 
another. 

e Low dimension images from the Internet are very difficult to process 
in an automation algorithm. The semantic gap occurs when low-level 
features are obtained from image annotation. 
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e Annotation of new images will be possible only after training and learn- 
ing of the model. The task of object recognition for semantic prediction 
is a challenging work [3]. 


Pixel-based semantic segmentation allows the classification of objects mak- 
ing computer vision localize the images and predicts the object more pre- 
cisely. It coordinates the various articles in various regions as a similar class 
and uses them in preparing the model. It names every pixel of an image 
for understanding the features in images and solves the computer vision 
problem into deep-learning-based segmentation. It upgrades the precision 
in locating multiple objects using computer vision. The satellite images like 
monitoring of oil spills are used with semantic segmentation model to gather 
the information. 

The deep learning method employs a convolution neural network; it uses 
the concept of machine learning along with stacking of depth and width of 
layer architecture. The extraction of discriminative features or image repre- 
sentations from the input data determines the performance of a pre-trained 
network. Machine learning algorithms perform a better feature extraction 
compared to traditional models. Many hidden layers in neural networks [18] 
are capable of deriving great levels of abstraction from raw data. Convolu- 
tional neural networks (CNNs) [21] are used to learn image representations 
and can be utilized to solve a variety of computer vision challenges. Deep 
CNNS, in particular, are made up of numerous layers of linear and non-linear 
processes that are all learned at the same time. The settings of these layers 
are learned over numerous iterations to solve a specific task. In recent years, 
CNN-based approaches for feature extraction from pictures and video data 
have gained popularity. A CNN is made up of convolutional and pooling 
layers that alternate in appearance. Convolutional layers are made up of stacks 
of predefined-size filters that are convolved with the layer’s input. The depth 
of the CNN can be raised by making the pooling layer’s output the next 
convolutional layer’s input. The convolution layer effectively learns visual 
characteristics. It takes an input image and creates an output with the same 
dimensions and classes. Convolution, activation or ReLU, and pooling are the 
most popular layers. 


e Convolution processes the incoming images through a series of con- 
volutional filters, each of which activates different aspects of the 
images. 

e By mapping negative values to zero and preserving positive values, the 
rectified linear unit (ReLU) enables faster and more successful training. 
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Because only the activated characteristics are carried on into the next 
layer, this is frequently referred to as activation. 

e Pooling reduces the number of parameters that the network must learn 
by conducting non-linear downsampling on the output. 


The performance of CNNs with smaller filter sizes (3 x 3) and deeper 
architectures has improved. The ResNet18 [19] is one such example (shown 
in Figure 8.3. 


8.2 Related Work 
8.2.1 Image Annotation Algorithm 


Image annotation the usage of CCA-KNN, that’s a brand-new version pri- 
marily based totally on combining the functions including CNN and phrase 
embedding vector. The foremost goal of this technique is to keep away from 
more than one function computing and additionally beneficial in lots of real- 
global applications. The findings are presented for all three versions of the 
CCA designs, with the linear, kernel, and k-nearest neighbor clustering. CCA- 
KNN, which beats previous results and achieves similar results on all the 
datasets [24], has been used to obtain outstanding results. Image annotation 
at pixel level is executed through guided filter network (GFN). GEN facil- 
itates in developing labels and optimizing iteratively to label the very last 
photograph. Comparing the conventional weakly supervised segmentation 
methods, semantic segmentation performs well [26]. From the multi-label 
dataset, decided on labels are used primarily based totally on rating function. 
The annotation set of rules consists of the fusion of CNN functions and 
VGG16 spine community alongside ideal thresholding. This thresholding 
idea avoids the downsampling of photos and predicts the right label masks. 
The gain of this technique is the stepped forward parameters and downside 
is because of shallow CNN; better stage functions cannot be anticipated 
accurately. It may step forward the usage of deepening the layers [9]. The 
convex deep mastering fashions including tensor deep stacking network and 
kernel deep convex network are used to annotate the photograph and it, in 
particular, makes use of CNN functions. It takes much less time to teach the 
photograph [14]. 

Faster RCNNs with pre-skilled fashions VGG-16 and RFCN with 
ResNet101 are fine-tuned to categorize items into both foreground and back- 
ground. It has been found that the proposed automated annotation method 
could be very green in detecting any unknown items even in an unknown 
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environment. Robustness of the version to any new item has been confirmed 
with foreground detection outcomes while examined on completely new item 
sets. The version is likewise tested to be strong to photos of any digital 
digicam resolutions and unique light conditions. The proposed annotation 
method is framed to generate a square ROI round every item, however will 
now no longer be capable of generating a segmented item location the usage 
of the given architecture. In order to get the precise contour of an item, 
these paintings may be prolonged by making use of pixel-clever semantic 
segmentation techniques, like Mask RCNN or PSPNet in the area of faster- 
RCNN/RECN [21]. The most important demanding situations get up from 
the problem of characterizing complicated and ambiguous contents of the 
satellite TV for PC photos and the excessive human exertions value due to 
getting ready a huge quantity of education examples with brilliant pixel-stage 
labels in absolutely supervised annotation methods. This trouble is overcome 
in a weakly supervised style with a green excessive-stage semantic func- 
tion moving scheme. The proposed method made complete use of auxiliary 
satellite TV for PC photograph information set to examine excessive-stage 
functions primarily based totally at the SDSAE deep mastering technique 
after which transferred the found-out functions to carry out annotation with 
handiest a small quantity of education information, which basically decreased 
the value of labeling education information. Evaluations have proven that our 
technique can offer aggressive overall performance in comparison with the 
absolutely supervised methods [27]. To create a multi-label learning issue, 
use the automatic picture annotation method on deep CNN. When training 
this model, the images generated can help to reduce overfitting and improve 
the generalization capabilities of the CNN model [10]. 

Superpixel segmentation, a novel set of criteria, is combined with a hierar- 
chical Dirichlet technique to analyze objects in images that are represented as 
a bag of words. Superpixels that appear in a cluster on a regular frequency are 
stitched together to create false composite images, and the associated labeling 
is examined using a reverse image search [16]. The combined prediction, 
post-processing, and adjustment model can be used in annotating images 
or videos. The aerial images, which have much less salient capabilities for 
detection. Also, by means of leveraging consumer click on facts and the 
adjustment version, we are able to enhance the general IOU and expand 
the framework at some stage in runtime to evolve to new instructions whose 
classified schooling statistics are not conveniently available [19]. 

To address the issues faced by diverse objects under varied lighting 
circumstances, Labeling Tool which was created for image segmentation is 
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more efficient. Superpixels are the fastest for items that are somewhat large 
and have a lot of contrast with their surroundings. Polygons are useful for 
large or low-contrast objects, but the brush is useful for small details. The 
ground truth labels have been shared as mask files in a publicly available 
dataset for other researchers to use. The image dataset, as well as the photo 
labeling tool, will be upgraded in the future [20]. 


8.2.2 Semantic Segmentation 


For segmentation, semantic segmentation models such as FCN, U-Net, and 
DeepLabV3 were utilized. Even though the dataset has one-of-a-kind ver- 
sions in digital digicam calibration, the pre-skilled deep-learning-primarily 
based totally segmentation fashions deliver correct effects. Each fash- 
ion’s overall computing performance on the CPU and GPU is calculated. 
DeepLabV3 and U-Net are a little slower than the FCN version in terms of 
inference time. Deep-learning-based segmentation models outperform stan- 
dard segmentation approaches by a significant margin [8]. The DeepLabV3+ 
semantic segmentation framework mixed with the Xception version produces 
inadequate facts withinside the nearby aspect [7]. 

All the prevailing capabilities presently have the hassle of currently no 
longer sufficiently describing the images. The goal of the AIA technique 
is to bridge the gap between low-stage visible photo capabilities collected 
by machines and high-stage semantic notions perceived by humans. Another 
annoyance is the simultaneous execution of all annotations. When confronted 
with huge education datasets, several photo annotation fashions require a 
lengthy time and computational complexity inside the education phase, mak- 
ing them computationally extensive [1]. The CMGGAN network is utilized 
to develop a unique technique for semantic segmentation of range sensor 
data. The suggested technique completely avoids the luxurious and difficult 
labeling of ranging sensor facts because semantic segmentation networks are 
pre-trained on the fact sets. Although the experimental outcomes look to be 
promising, it is critical to train the model on a much larger dataset, which will 
most likely include more dynamic item detections [12]. On SAR pictures, 
the one-of-a-kind implementation of the U-Net structure is investigated, with 
one where U-Net is used from the scratch, while the other is from pre- 
trained weights. The switch was flipped. U-Net is capable of recognizing 
minute details inside a picture, such as little rivers and other features [18]. 
Pixel-level semantic segmentation is improved by combining global context 
information with local picture attributes. First, in the encoder network, we 


8.3 Proposed Method 219 


establish a fusion layer that allows us to merge global and local character- 
istics. In the decoder network, a stacked pooling block is used to extend 
the receptive fields of features maps and is needed to contextualize local 
semantic predictions. This strategy is based on ResNet18, which decreases 
the number of parameters in our model and allows it to predict better than 
earlier models [22]. 


8.3 Proposed Method 


Image annotation has major critical worth in recovering and concealing pic- 
tures with various classes. This chapter incorporates the annotating of pictures 
utilizing semantic segmentation. The proposed method has two stages. At 
first, the semantic segmentation is done utilizing pre-trained DeepLabV3+ 
model and later the image labeling is done by MATLAB Image Labeler. 
The proposed work is carried out on utilizing MATLAB Image Labeler 
and distinctive SAR datasets that incorporate ENVISAT, ALOSPOLSAR, 
TERRASAR, and SENTINAL. 

MATLAB Computer Vision tool kit is more proficient in picture handling 
than OpenCV. The target of this proposed calculation is to remove classes 
from a picture and mark them with a reasonable class utilizing DeepLabV3+ 
(a pre-prepared convolutional neural network). Figure 8.2 depicts an architec- 
tural overview of the proposed framework. The basic steps are: image acquisi- 
tion, pre-processing, semantic segmentation, and automation algorithm, and, 
finally, annotated image is generated. 

The proposed work comprises the accompanying modules. 

e Image pre-processing 
e Semantic segmentation 
e Annotation algorithm 


8.3.1 Image Pre-Processing 


It is utilized to upgrade pictures that aid in the precision of results. MATLAB 
has image pre-processing instruments that aid in including investigation 
and commotion decrease. In this proposed work, forgetting the blunder-free 
outcomes in semantic division measures, the following pre-processing steps 
are necessary. 


1. Data Acquisition 


SAR oil spill datasets are collected from different satellites such as ENVISAT, 
SENTINEL, ALOSPOLSAR, and TERRASAR. The characteristics of 
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Figure 8.1 Ground truth image. 


Table 8.1 Characteristics of SAR satellites 


Satellite Spectral Spatial Temporal Swath No. of 
sensors bands resolution resolution width datasets 
TERRASAR X 18.5 11 260 2 
ENVISAT C 30 3 400 12 
ALOS L 6 14 350 13 
POLSAR 

SENTINEL SWIR 10 5 290 13 


satellites are shown in Table 8.1. The oil spill images are obtained from the 
Gulf of Mexico (2010), the Mediterranean shoreline of Israel (2021), and 
MV Wakashio oil spill of Mauritius (2020). All pixels within the image had 
been categorized into classes, specifically oil, and background as shown in 
Figure 8.1. The labels had been used to create ground truth records for 
training and validation of the semantic segmentation algorithm. The Image 
Labeler tool in MATLAB software was used to carry out the labeling tech- 
nique. Labeling can be done by hand, semi-automatically, or automatically 
with the use of an automation program. Automated labeling is used in this 
study to guarantee that the region of interest is appropriately delineated. 


2. Image Resizes 


The original picture size and pixel picture size for semantic division network 
actuations ought to be the same for all datasets, to keep away from the 
DAGNetwork mistake. 
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3. Data Augmentation 


As the training dataset only contains 40 images, overfitting is a serious 
concern. If we train for a couple of epochs over this small dataset, we would 
fear that our version will begin becoming the noise on this small dataset, main 
too bad overall performance on out-of-pattern examples. This trouble may be 
relatively mitigated by means of records augmentation. It helps in expanding 
network exactness. DeepLabV3+ is trained using augmented image data. 
Data augmentation enabling saves you the community from overfitting and 
memorizing the precise information of the schooling images. Load the pattern 
data, which includes oil spills. 


4. Log Normalization 


It is a pre-processing for the machine learning process in MATLAB. It is a 
method for standardizing the data. It helps to enhance dark features. It applies 
the log transformation to the pixel values in the image. It is used to increase 
the range of dark areas while avoiding clipping bright areas. 


x’ = log(x + 1). 


The pixel values of the output and input images are x’ and x, respectively. 
To avoid a O value in pixels, each pixel value is multiplied by 1. The minimum 
pixel intensity value should always be at least one. 


Log Transformations’ Properties: 


e The range of gray levels is enlarged for input images with lower 
amplitudes. 

e The range of gray levels is compressed for greater amplitudes of the 
input image. 


5. Hybrid Median Filter 


It is the extension of the median filter and preserves the edges better. It 
smoothens the noise in the image. The steps involved in HMF are as follows. 


1. Get median value for horizontal and vertical pixels. 

2. Get median value for diagonal pixels. 

3. Find the center pixel value. 

4. Find median again for the first, second, and third steps and keep 
replacing for new value. 


Figure 8.2 depicts the suggested method’s overall architecture. It shows 
the process of automatic annotation of the dataset. The dataset is used from 
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Figure 8.2 An architectural overview of the proposed framework. 


the several SAR satellites listed in Table 8.1. The pre-processing is applied 
to the dataset. DeepLabV3+ semantic segmentation with ResNet18 as the 
backbone is trained and saved in MATLAB. Load the pre-trained model in 
custom algorithm creation to annotate the image automatically. 


8.3.2 Semantic Segmentation 


Semantic segmentation is an advancement to make forecasts from coarse to 
fine deduction by making thick expectations construing names for each pixel; 
so every pixel is marked with the class. The process of semantic segmentation 
includes the following. 


1. Label Data or Obtain Labeled Data 


For semantic segmentation, a pixel-wise label or ground truth label is applied, 
which means for each pixel in the training set, there is a label pixel for it. And 
all pixels have their own label. For example, we have 512 x 512 x 3 image 
dimensions; so we will have 512 x 512 label data. Every pixel in the image 
will have one label. So, in this work, there are two objects such as oil and 
background in the image, 0,1 as the classes label. The label image will have 
512 x 512 containing label 0,1 for each pixel. 


2. Create a Datastore for Original Images and Labeled Images 


The datastore is created for original input and ground truth data. Once the 
image is labeled using the Image Labeler tool, the labeled images are stored 
as Pixel Label Data. Both the datasets are further used for training or data 
augmentation process. 


3. Create a Semantic Segmentation Pre-Trained Network 


An encoder and a decoder form the foundation of image segmentation archi- 
tecture. Filters are used by the encoder to find exact features from the image. 
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The decoder is responsible for generating the final output, which is often a 
segmentation mask representing the shape of the object. This architecture or 
a version of it can be found in almost all architectures. Deep lab architecture 
is used in this study. Convolutions with upsampled filters are employed in 
this architecture for jobs that need dense prediction. To partition objects at 
various scales, atrous spatial pyramid pooling is utilized. CNN is employed 
to improve object boundary localization. Upsampling the filters through the 
insertion of zeros or sparse sampling of input feature maps is used to produce 
atrous convolution. 


4. Train and Evaluate the Network 


The input images used color images with the resolution of 512 pixels x 
512 pixels for both DeepLabV3+ and SegNet networks. Due to the different 
network architectures used, the setup parameters of each architecture were 
fine-tuned differently. For instance, the batch size of each model was adjusted 
due to the single CPU. However, the number of epochs in this study was set 
to be consistent to allow fair comparison and for less training time. To avoid 
overfitting, the training is terminated once the loss computed on the validation 
set has worsened for four consecutive epochs. 


Training Phase: 


DeepLabV3+ was trained using stochastic gradient descent with momentum 
(SGDM) method with an L2 regularization value of 0.005. The learning rate 
followed a piecewise schedule that reduced it by a factor of 0.3 every three 
epochs from an initial value of 0.0002. For every 10 epochs, the network 
multiplied the starting learning rate of 0.001 with 0.3, and training was 
completed over 30 epochs. All the models were trained and validated using 
MATLAB R2020a on a system with a single CPU. Table 8.2 summarizes the 
hyperparameters used for each network architecture in this study. 


Table 8.2 Hyperparameters for network architecture 


Model Sequential parameters 
Activation function (input) ReLU 

Activation function (output) SoftMax 

Optimizer SGDM 

Loss function Cross entropy 

No. of epochs 3 

Batch size 5 

Validation split 0.10 
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Inference Phase: 


The inference phase was carried out to determine how well the trained 
networks performed when new images were being tested. In order to do that, 
all three models are applied to several images that have not been used during 
the training phase to determine how general the model was. 


DeepLabV3+: 


Chen et al. [5] created a segmentation model based on deep learning. 
Chen et al. [5] created a deep-learning-based segmentation model recently. 
Encoder-decoder architecture is commonly used in mannequins, as shown 
in Figure 8.5. The DeepLabV3 network [10] uses atrous convolution, which 
allows the convolutional layer to increase the corresponding receptive subject 
of convolution without lowering the spatial dimension or increasing the 
number of parameters of this connection, hence strengthening the network’s 
segmentation impact. Simultaneously, V3 enhances the ASPP module and 
refers to the hybrid dilated convolution (HDC) [9] concept, which is uti- 
lized to reduce the “Gidding issue” produced by accelerated convolution 
and extend the receptive area to mixed world input while keeping the spine 
[4, 22] The encoder—decoder structure is used in “DeepLabV3+.” To improve 
the object boundaries, DeepLabV3 is utilized to encode the rich contextual 
information, and an easy yet effective decoder module is used. It is also worth 
noting that the atrous convolution may be used to extract encoder aspects at 
any resolution using the available computation resources [11]. The images 
are segmented using the following typical deep networks. 


1. AlexNet: It is an eight-layered CNN composed of five convolutional 
layers, max-pooling ones, ReLU as non-linearities, three completely 
convolutional layers, and dropout. The component extraction is done in 
every one of the layers. The AlexNet with more profound highlights is 
more abstract [4]. 

2. VGGI6: It utilizes a heap of convolution layers with little responsive 
fields and is better than having a huge layer. 

3. ResNet18: It is notable because of its profundity (152 layers) and the 
presentation of leftover squares. The leftover squares address the issue 
of preparing a truly profound design by presenting character skip asso- 
ciations so that layers can duplicate their contributions to the following 
layers shown in Figure 8.3. 

4. MobileNetV2: It is a convolutional neural organization layout that 
appears to carry out properly on mobile phones. It relies upon a 
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Figure 8.3  ResNet18 architecture. 


disappointing leftover layout in which the ultimate connections are 
among the bottleneck layers. The center extension layer channels high- 
light as a source of non-linearity by using light weight depth-wise con- 
volutions. Overall, MobileNetV2’s structure carries an underlying 32- 
channel absolutely convolution layer, which is followed by 19 remaining 
bottleneck layers. Figure 8.5 depicts the mobile net architecture [13]. 

5. Xception: The ex-foothold base of the organization is shaped by 36 
convolutional layers in the Xception design. With the exception of 
the first and last modules, the 36 convolutional layers are grouped 
into 14 modules, all of which have straight lingering linkages around 
them. So, the Xception engineering is a straight heap of depth-wise 
divisible convolution layers with remaining connections. It is shown 
in Figure 8.4 [6]. 

In this work, ResNet18, MobileNetV2, and Xception backbones are trained 
and results are compared. ResNet18 which is more accurate is used as the pre- 
trained model for image annotation. In this proposed work, ResNet18 is used 
as the backbone and it performs well compared to Xception. The advantages 
of ResNet18 compared to other backbones are as follows. 
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Figure 8.4 Xception architecture. 
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Figure 8.5  MobileNet architecture. 


e DeepLabV3+ models using ResNet18 that were optimal in terms of 
inference time and accuracy. 

e Networks with a large number of layers can be easily taught without 
increasing the percentage of training errors. 

e ResNet18 can help with identity mapping to solve the vanishing gradient 
problem. 


8.3.3 Automation Algorithm 


By labeling ground truth data in a collection of datasets, the Image Labeler 
supports semantic segmentation. The Image Labeler tool includes the follow- 
ing features, which are listed below. 


e Labeling images has different types of ROI regions of interest labels 
such as rectangular, polyline, pixel, polygon, and scene labels. 

e Use built-in or custom algorithms to label your ground truth data. 

e Automation algorithms are evaluated using a visual summary. 

e Export the labeled ground truth as an object that is used for training 
semantic segmentation networks. 


All picture file formats are supported by the Image Labeler program. For high 
image sizes, it employs a blocked image. A blocked image is a huge image 
that has been broken down into tiny blocks to fit into memory. MATLAB 
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Figure 8.6 Loading image and selecting custom algorithm. 


Image Labeler is used for creating algorithms. The steps involved in creating 
an automation algorithm are given below. 


e Open Image Labeler and load the image to be annotated as shown in 
Figure 8.6. 

e Create ROI labels as used in creating semantic segmentation. 

e Click the Select Algorithm dropdown to select an existing algorithm 
named Oil Spill Segmentation as shown in Figure 8.6 or create a new 
algorithm. 

e Click Automate and run the algorithm to annotate images. 

e Save the annotated images as shown in Figure 8.7. 


Figure 8.7 Annotated image. 
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Figure 8.9 Flow of algorithm execution. 
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The methods that are invoked while running are as follows. 


e A Check Label definition is used when the automated function is 
clicked; it checks the ROI with the scene label. If the pixel label is not 
in the scene, then the label is not included. If the pixel label is there in 
the scene, then the label is assigned. 

e Check-setup method to check the validity of the conditions like the scene 
should have at least one ROI label. 

e Initialize the method to introduce the state for your custom algorithm by 
utilizing the scene. 

e Run method to execute the algorithm that performs the image annota- 
tion. The annotated image is shown in Figure 8.7. 
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e Terminate method to end the state of the automation algorithm after the 
algorithm runs. 


8.4 Performance Measures 


In the proposed work, using oil spill SAR images, the evaluation of 
DeepLabV3 plus backbones such as ResNet18, MobileNet, and Xception 
are compared. ResNet18 gives more accurate results than Xception. The 
comparisons of the backbone of the DeepLabV3+ network are shown in 
Figure 8.12. Data augmentation improves the outcome. DeepLabV3+ back- 
bones are trained using the “stochastic gradient descent with momentum 
optimizer” with a batch size of three picture tiles on a single CPU. Le-3 
learning rates are fixed in all backbones. The training process contains a 
dataset of 40 images that are divided into two subsets and trained on the 
ResNet18 and Xception pre-trained models for three epochs. ResNet18 per- 
formance is accurate compared to Xception, and then all the 40 datasets are 
trained using ResNet18. The training accuracy and loss rate of ResNet18 are 
shown in Figure 8.15. Some frequently accepted performance measures, such 
as accuracy, BF score, and intersection over union, are presented in Table 8.3, 
together with the confusion matrix, enabling quantitative assessment of the 
accuracy of semantic segmentation outputs. The performance measures of 
different backbones are shown in Figure 8.11. The confusion matrix of the 
predicted labels and ground truth labels are shown in Table 8.2. The confusion 
matrix shows the segmentation correctly classified 990,321 pixels as oil and 
8,758,554 pixels as background. The confusion matrix also shows that the 
segmentation misclassified 146,139 background pixels as oil and 457,267 
oil pixels are misclassified as background. Class-specific accuracy measures 
are used to calculate the proportion of correctly identified pixels from the 
reference (sensitivity) and the fraction of correctly classified pixels from the 
output. The training progress of all the three backbones of DeepLabV3+ is 
shown in Figures 8.13-8.15. Comparatively, ResNet18 performs well and an 
annotation algorithm is implemented using this algorithm. 


8.4.1 Evaluation of Segmentation Models 


1. Pixel Accuracy 


The pixel accuracy is normally pronounced for every class one after the other 
in addition to globally throughout all training. The per-class pixel accuracy 
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is evaluated by the use of a binary mask. This statistic might occasionally 
produce deceptive results when the class illustration is small inside the image 
because the degree can be skewed in reporting how effectively you detect 
negative cases when the class illustration is small within the image. 


TP+ TN 
TP +TN + FP + EN 


Accuracy = 


where TP = A pixel that has been appropriately identified as belonging to the 
specified class. FP = A pixel that is incorrectly assigned to the supplied class. 
TN = A pixel that has been successfully identified as belonging to a class 
other than the one specified. FN = A pixel that has been incorrectly identified 
as not belonging to the specified class. 


2. Intersection Over Union 


The Jaccard index, often known as the intersection over union (IoU) measure, 
is a strategy for quantifying the percentage of overlap between the masks 
and the output of our forecast. This metric is closely related to the Dice 
coefficient, which is often used in training as a loss characteristic. Using the 
whole range of pixels accessible across each mask, the IoU metric splits the 
range of pixels at a common point across the target and prediction masks. The 
IoU is calculated using the formula below. 


_ Target N Prediction _ a 


- Target U Prediction — Fy l 


The intersection (AB) is made up of pixels from both the prediction and 
ground truth masks, whereas the union (AB) is made up of all pixels from 
either the prediction or the target mask. The IoU is determined one by one for 
each class and then averaged across the entire of our semantic segmentation 
prediction. 


3. Mean BF Score 


IoU 


To assess if a point on the anticipated boundary matches the ground truth 
boundary, the BF score is the harmonic mean (F1-measure) of the precision 
and recall values multiplied by a distance error tolerance 


2 x (precision + recall) 


BFscore = o 
precision + recall 


8.4 Performance Measures 231 


It shows how well each class’s anticipated boundary matches the true 
boundary. In comparison to the IoU metric, the BF score correlates better 
with human qualitative assessment. When you use a confusion matrix as the 
function’s input, this statistic is not available. 


e A class’s mean BF score is the average BF score for all photos in that 
class. 

e The average BF score of all classes in a given image is the image’s mean 
BF score. 

e The average BF score of all classes in all photos is the mean BF score of 
a separate dataset. 


Confusion Matrix: 


A confusion matrix table describes the performance of segmentation on 
a set of test data for which the true values are known. Each entry in a 
confusion matrix represents the number of predictions made by the model 
where the classes were properly or incorrectly classified. Figure 8.10 depicts 
the confusion matrix. ResNet18’s confusion matrix is displayed in Table 8.5. 
The total loss rate, false positive rate or type I error, and false negative rate 
or type Il error are determined using the confusion matrix of all backbones 
and are displayed in Table 8.4. ResNet18 has a loss rate of 0.4, which is lower 
than other backbones. There is a 0.03 false negative rate as a result. ResNet18 
has a high level of accuracy. The following is a list of rates derived from the 
confusion matrix: 


1. Accuracy = (TP + TN)/total 
Misclassification Rate = (FP + FN)/total 
True Positive Rate = TP/Positive Value 
False Positive Rate = FP/Negative Value 
True Negative Rate = TN/Negative Value 
Precision = TP/Positive Prediction 
Prevalence = Positive Value/total 


NAAR WD 


The accuracy demonstrates the class’s total correct predictions. The entire 
error rate of the class, which is equal to 1 minus accuracy, is shown by 
the misclassification rate or error rate. The true positive rate, also known 
as “sensitivity” or “recall,’ indicates whether or not a forecast was true. 
When it is actually incorrect and predicts as positive, the false positive rate 
indicates that. True negative rate, also known as “specificity,” indicates when 
a prediction is inaccurate and forecasts a positive outcome. It is the same 
as 1 minus the false positive rate. Precision can tell the difference between 
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a true positive and an accurate result. The term “prevalence” refers to how 
frequently a positive result appears. 


True Class 
Positive Negative 


Predicted Class 
Positive 


Negative 


Figure 8.10 Confusion matrix. 
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Figure 8.11 Performance measure comparisons of backbones of DeepLabV3+. 


Table 8.3 Comparisons of backbones of DeepLabV3+ 


Method Model Accuracy IoU BF score 
ResNet18 96 93 67 

DeepLabV3+ Xception 75 60 45 
MobileNet 88 78 45 
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Table 8.4 Comparison of loss rate of pre-trained model 
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Method Model Overall loss | False positive | False 
rate negative 
ResNet 18 0.4 0.12 0.03 
DeepLabV3+ Xception 0.20 0.18 0.04 
MobileNet 0.12 0.12 0.25 
Table 8.5 Confusion matrix of ResNet18 pre-trained model 
Label Oil Background 
Oil 990,321 457,267 
Background 146,139 8,758,554 


a b c d e 


Figure 8.12 Comparisons of ResNet18 and Xception backbone architecture. a) Original 
image. (b) Ground truth image. (c) ResNet18. (d) Xception. (e) MobileNet. 
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Figure 8.13 MobileNet pre-trained model training progress. 
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Figure 8.14 Xception pre-trained model training progress. 
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Figure 8.15 ResNet18 pre-trained model training progress. 


8.5 Conclusion 


The proposed work is used for annotating the images using SAR dataset. 
For every image in the dataset, annotations are produced and are displayed. 
When we input an image, it will be segmented to detect oil spill by semantic 
segmentation based on DeepLabV3+ with ResNet18 backbone, and the input 
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image is compared to the correctness of the objects detected. The proposed 
method gives us a robust methodology for extracting oil spills in images at 
low time complexity. We have tested our result with the Xception backbone 
network, but it takes a lot of time and space to segment the image. Our 
proposed method takes comparatively less time than the other method to 
detect an object in the image. Pixels assist in the modeling and annotation of 
data. Finally, to improve usability in the realm of computer vision, the system 
can be connected with various retrieval methods. DeepLabV3+ backbones 
were compared, and ResNet18 was shown to be superior to the others. Both 
networks could distinguish features of the oil from the background with some 
misidentification of pixels due to similar features. This work focused on the 
accuracy factor only without considering the training time and memory usage. 
When measuring semantic segmentation metrics on the overall test set results, 
it was found that the DeepLabV3+ network with ResNet18 performed better 
than other networks by achieving above 96% for overall accuracy and IoU 
metrics, and approximately 67% for BF score. 

In the future, improving the accuracy of the system will have a different 
ensembles approach. Better and more training images per semantic notion 
may result in more stable models. Future work can be extended using an 
extensive labeled dataset and hyperparameter fine-tuning to improve the 
segmentation accuracy. 
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Abstract 


Weather is a multifaceted, dynamic, and chaotic process that occurs in 
real time. Forecasting and monitoring the weather is difficult due to these 
characteristics. Wireless gadgets are crucial fragments not only important to 
the organizations for development control, yet, moreover, in step-by-step life 
for security of building’s and movement stream assessing, common parame- 
ters estimation. In atmosphere checking, factors, for instance, temperature, 
soddenness, and weight, are to be assessed for this wander; thus, sensors 
have reliably been given the endeavor for doing all things considered. Data 
getting structures are, to a great degree, surely understood for purchasers 
and present-day users. The proposed shape has three sensors that process 
uncommon factors as communicated beyond and for rain fall recognizable 
evidence and storm bearing tempo estimation environment tool is included to 
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stored data and compared to the previous 60 years of weather data to predict 
future weather using convolutional neural networks. Previously, meteorolo- 
gists employed a variety of approaches for weather predicting, ranging from 
simple temperature, rain, air pollution, and moisture observations to com- 
plicated computerized mathematical models. Convolutional neural networks 
(CNNs) are a strong deep-learning-based data modeling tool for capturing 
and representing complex input/output interactions. The actual strength and 
advantage of convolutional neural networks is their ability to model both 
linear and nonlinear relationships straight from the data. Based on the experi- 
mental approach performed in MATLAB 2013a, the quality and performance 
of these algorithms are evaluated. The proposed system performances are 
93.56%, 94.12%, and 94.32% in terms of accuracy, sensitivity, and precision. 
The application of convolutional neural networks has produced the most 
accurate weather prediction when compared to the existing technique such 
as support vector machine and decision tree. For the most part, the modeling 
results show that reasonable forecast accuracy was attained in the proposed 
system. 


Keywords: Convolutional neural networks, weather prediction, temperature, 
moisture, rain fall detection, MATLAB 2013a. 


9.1 Introduction 


Weather simply refers to the state of the environment factor on the planet at 
a specific location and time. It is a process that is ongoing, data-intensive, 
complex, dynamic, and chaotic. These characteristics make weather forecast- 
ing a difficult task. Forecasting is the practice of making educated guesses 
about unknown conditions based on historical evidence. Weather forecasting 
is one of the most difficult scientific and technology challenges to solve in the 
previous century (Kalimuthu, S. et al., 2021). Making an accurate prediction 
is, without doubt, one of the most difficult tasks that meteorologists face 
around the world. Weather prediction has been one of the most interesting 
and exciting fields since ancient times. Scientists have explored a variety 
of approaches to forecast meteorological characteristics, some of which are 
more accurate than others specified by Dehghanian, P. er al. (2018). 
Scientific weather forecasting, which involves predicting the state of the 
atmosphere at a certain area, necessitates meteorology knowledge. Human 
weather forecasting is a good example of the need to make decisions in 
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Figure 9.1 Weather forecasting features (Galanis, G., 2017). 


the face of uncertainty. Weather predictions are often generated by gath- 
ering quantitative data about the current condition of the atmosphere and 
using scientific knowledge of atmospheric processes to estimate how the 
environment will evolve in the future, as proven by Galanis, G. (2017). The 
value of learning about the cognitive process in weather forecasting has been 
recognized in recent years. Even while most human forecasters employ ways 
based on meteorology to deal with the challenges of the job, as Gómez- 
Romero, J. et al. (2018) point out, forecasting the weather becomes a duty 
for which the specifics can be quite personal. 

Weather forecasting entails predicting how the current area of the atmo- 
sphere will change in the future. Ground observations, observations from 
ships, observations from aero planes, radio noises, Doppler radar, and satel- 
lites are all used to determine current weather conditions. This information 
is delivered to meteorological centers, which collect, analyze, and present 
the information in various charts, maps, and graphs. Using contemporary 
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high-speed computers, thousands of observations are converted onto surface 
and upper-air maps. Forecasting the weather in the future requires the use of 
weather forecasts. Weather forecasting employs a variety of methodologies, 
ranging from simple sky observation to highly complicated computerized 
mathematical models developed by Ukhurebor, K. E. er al. (2017). Weather 
forecasts can be made for one day, one week, or several months in advance. 
Weather forecasts, on the other hand, lose a lot of accuracy after a week. Due 
to its chaotic and unpredictable nature, weather forecasting remains a difficult 
business. It is still a procedure that is neither entirely scientific nor entirely 
artistic. Wilgan, K. et al. (2017) illustrate how people with little or no formal 
instruction can gain significant forecasting skills. Farmers, for example, are 
typically quite capable of producing their own short-term forecasts of those 
meteorological conditions that directly affect their livelihood, while pilots, 
anglers, and mountain climbers are similarly skilled. Weather events, which 
are typically complicated, have a direct impact on such people's safety and/or 
economic stability. Accurate weather forecast models are critical in third- 
world countries, where agriculture is entirely dependent on the weather. 
Identifying any patterns for weather parameters to depart from their peri- 
odicity, which would damage the country’s economy, is therefore a serious 
worry. The threat of global warming and the greenhouse effect has heightened 
this worry Schumacher, R. S., 2017. Extreme weather events are becoming 
increasingly costly to society, inflicting infrastructure damage, injury, and 
even death (Singh, M. et al., 2021). 

Weather forecasting, as conducted by professionally educated meteorol- 
ogists, is now a highly developed skill based on scientific principles and 
methods, utilizing advanced technical tools. Since 1950, technology advance- 
ments, basic and applied research, and the application of new information and 
procedures by weather forecasters have resulted in a significant improvement 
in forecast accuracy, as Pierce, F. J., & Lal, R. (2017) demonstrate. High- 
speed computers, meteorological satellites, and weather radars are examples 
of tools that have helped improve weather forecasting. A number of other 
elements have aided in the improvement of predicting accuracy. Another 
advantage of meteorological satellites is their enhanced observational capa- 
bility. The ongoing improvement of the initial conditions prepared for the 
forecast models is a third main cause for the increase in accuracy (Kalimuthu, 
S., 2021). 

Statistical approaches can anticipate a wider range of meteorological 
factors than models alone can, and they can adjust the less exact model 
forecasts to specific places. On a worldwide scale, satellites now allow for 
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practically continuous viewing and remote sensing of the atmosphere. An 
increase in the quantity of observations and greater use of the observations in 
computational approaches has resulted in improved initial conditions. 


9.1.1 Types of Weather Forecasting 


A daily weather forecast is made possible by the contributions of thousands 
of observers and meteorologists from all around the world. Cloud photos are 
captured from space by weather satellites orbiting the planet, and modern 
computers make forecasts more precise than ever. Forecasters create their 
predictions based on observations from the ground and space, as well as 
algorithms and principles based on historical experience. Meteorologists 
create daily weather forecasts by combining various distinct methodologies 
(Maleki, H. et al., 2019). They really are. 


a) Computer forecasting 
b) Synoptic forecasting 
c) Persistence forecasting 
d) Statistical forecasting 


9.1.1.1 Computer Forecasting 

Forecasters use their observations to enter numbers into complex calculations. 
These many equations are executed on several ultra-high-speed computers to 
create computer “models” that provide a forecast for the following several 
days (Poterjoy, J. et al., 2019). Because different equations often produce 
different outcomes, meteorologists must always combine this strategy with 
other forecasting methods (Khandakar, A. et al., 2019). 


9.1.1.2 Synoptic Forecasting 

The basic rules for predicting are used in this strategy. Meteorologists use 
their observations and the laws they have learned to generate a forecast for 
the next few days (Kang, G. K. et al., 2018). 


9.1.1.3 Persistence Forecasting 

Persistence forecasting is the most basic approach of weather prediction. 
When the weather is stable, such as during the summer season in the tropics, 
this can be a good technique to forecast the weather. The occurrence of a 
stationary weather pattern is critical for this type of forecasting (Kumar, 
K. R., 2018). It can be used in both short- and long-term projections. This 
presupposes that the weather will continue to behave as it does presently. 
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Meteorologists perform weather observations to determine how the weather 
is behaving. 


9.1.1.4 Statistical Forecasting 

Meteorologists speculate on what the weather this is time of year or in future. 
Forecasters can gain a sense of what the weather is “supposed to be like” at 
a certain time of year by looking at historical data of rainfall, snowfall, and 
typical temperatures (Hosseini, S. M., & Mahjouri, N., 2018). The following 
is a list of the book chapter’s remainders. Section 9.2 discusses past weather 
predictions for various datasets and related work, Section 9.3 discusses the 
proposed feature selection mechanism with a convolutional neural network, 
and Section 9.4 compares the experimental outcomes of CNNs and existing 
systems. Finally, part five contains the work’s concluding thoughts and future 
scope. 


9.2 Literature Review 


For the scientific community, accurate weather forecasting is a big challenge. 
Computer models, observation, and knowledge of trends and patterns are all 
used in weather prediction modeling. Different weather forecasting methods 
can be used with these methods to get reasonably accurate forecasts. They 
are enumerated below. For 24-hour weather forecasting, Zubaidi, S. L. et al. 
(2020) used soft computing models based on radial basis function network. 
In comparison to the multilayer perceptron network, they found that radial 
basis function neural networks give the most accurate forecasts. 

As-syakur, A. R. et al. (2019) showed that the ANN model may be 
utilized as a suitable forecasting tool for rainfall prediction, outperforming the 
autoregressive integrated moving average (ARIMA) model. Also Erickson 
et al. (2018) employed artificial intelligence algorithms to estimate regional 
rainfall, and they discovered that this technique has a reasonable level of 
accuracy for monthly and seasonal forecasts. Huntingford, C. et al. (2019) 
provided a method for classifying and forecasting future weather using a 
back-propagation algorithm, as well as a discussion of previous weather 
forecasting models. Finally, the study shows that the new wireless medium 
technology can be utilized in the weather forecasting process. 

Waliser, D. et al. (2018) demonstrate a weather prediction program using 
support vector machines. Using ideal kernel values, the system’s performance 
is measured over time periods ranging from 2 to 10 days. Using ten years 
of meteorological data (1996-2006), Verbois, H. et al. (2018) investigated 
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artificial neural networks built on multilayer perceptrons. The findings sug- 
gest that the multilayer perceptron network has the lowest predicting error 
and can be used to construct short-term temperature forecasting systems. 

Yemane, S. et al. (2021) describe a weather prediction application using 
a back-propagation neural network. The real-time dataset is used to test their 
proposed proposal. The results were compared to the actual work of the 
meteorological department, and they confirmed that real-time weather data 
processing indicates that back-propagation network-based weather forecasts 
outperform not only numerical model guidance forecasts but also official 
local weather service forecasts. Talavera, J. M., et al. (2017) developed a 
feature-based neural network model for weather forecasting. 

To forecast maximum temperature, relative humidity, and minimum 
temperature, this model employs an FFANN with back propagation for super- 
vised learning. A trained artificial neural network was used to forecast future 
weather conditions. A feed forward neural network was utilized by Blair, G. 
S. et al. (2019) to predict typhoon rainfall. FNN was used to estimate the 
residuals from the linear model to the variations between anticipated rainfalls 
and data from a typhoon rainfall or snowfall climatology model, and the 
findings were good. 

One of the most popular supervised training methods is BPNN. Iterative 
weight updating based on minimizing the mean square error is commonly 
used in training. The error signal is then transmitted back to the lower layers 
using the steepest descent algorithm, which updates the network’s weights. 
The algorithm adjusts the weights of the network in such a way that the error 
decreases in a downward direction. The activation function for the back- 
propagation algorithm must be continuous and differentiable (Qing, X., & 
Niu, Y., 2018). 

The most widely used learning method for feed forward neural networks 
is back propagation. In terms of information flow direction, the feed forward 
neural network is the simplest ANN architecture. The back-propagation tech- 
nique can be implemented in two different ways: batch updating and online 
updating. The batch updating method, like the standard gradient method, 
accumulates the weight adjustment across all training samples before con- 
ducting the update. The online updating strategy, on the other hand, adjusts 
the network weights instantly after each training sample is fed (Fu, M. et al., 
2015). 

Back propagation is the most common learning strategy for feed forward 
neural networks. The feed forward neural network architecture is the most 
basic in terms of information flow direction. Many neural network topologies 
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employ the feed forward neural network. The back-propagation technique 
can be implemented in two different ways: batch and online. The batch 
updating strategy, like the classic gradient method, accumulates the weight 
correction across all training samples before conducting the update. The 
network weights are updated instantly after each training sample is fed in 
the online updating strategy, on the other hand (Zhao, H. et al., 2019). 

The vast majority of artificial neural network systems were supervised 
during training. The artificial neural network must be trained before it can be 
used in supervised learning. The network is trained using input and output 
data. The training set is the name given to this collection of data. The actual 
output of a neural network is compared to the desired output in this mode. 
In the next iteration, or cycle, the network adjusts the weights, which are 
generally set randomly at initially, so that the expected and actual outputs are 
closer. The learning method tries to reduce all processing elements’ current 
flaws. Adjusting the input weights until the network accuracy is adequate 
achieves this global error decrease over time (Kumar, Y. J. N. et al., 2020). 

Learning without supervision has a bright future ahead of it. It demon- 
strates how, in the future, computers may be able to learn on their own in a 
robotic sense. Self-supervised learning is the name given to this promising 
field of unsupervised learning. External factors have no effect on the weights 
of these networks. Instead, they keep track of their own performance. These 
networks search for patterns or trends in the input signals and adjust the 
network’s function accordingly. Even if it is not told whether it is correct 
or incorrect, the network must have some knowledge of how to arrange itself. 
The network topology and learning rules contain this information. Unsu- 
pervised learning is still a research topic because it is not well understood 
(Mihai, A. et al., 2019). The proposed convolution-based weather prediction 
approach used overcome the above-mentioned drawbacks to improve the 
performance within the time duration are discussed in the following section. 


9.3 System Design 


For capturing and displaying intricate input/output relationships, an NN is 
a common data modeling technique. The goal to design an artificial system 
capable of performing cognitive tasks similar to those done by the human 
brain fueled the creation of neural networks (Barton, T., & Musilek, P, 2019). 
A layered neural network’s neurons are organized into layers. An input layer 
of source nodes projects onto an output layer of neurons in the simplest 
version of a layered network but not the other way around (da Silva Fonseca 
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Figure 9.2 Multilayer feed forward network architecture. 


Jr., J. G. et al., 2012). A single-layer network is essentially a feed forward 
network because it only has one input and output layer. The input layer is 
not counted as a layer because it performs no mathematical operations. The 
input layer is not counted as a layer because it performs no mathematical 
operations. 

A feed forward network, to put it another way, is one in which data 
can only travel from one input layer to the hidden layers, and then to the 
output layer. In this type of paper, there are no feedback links (Liu, Y. 
et al., 2016). Figure 9.2 depicts the architecture of a multilayer feed forward 
network. The hidden neuron's job is to boost the amount of processing that 
happens between the input and output layers. This improves the accuracy 
of the network in use, allowing it to handle more difficult tasks. By adding 
more hidden layers, the network can analyze more weather data and extract 
higher order. The input signal is sent to the neurons in the second layer. The 
second layer’s output signal is sent into the third layer, and so on (Xu, G. 
et al., 2019). Figure 9.3 shows the input features and output predictions in the 
weather rain fall prediction UCI dataset using CNN hidden layer architecture. 
The proposed system architecture is depicted in Figure 9.4 along with a 
comparison for SVM and the decision tree algorithm architecture system 
flow. 
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Figure 9.3 General structures for weather forecasting system using CNNs. 


Back-propagation learning is the process of encoding an input-output 
relationship, represented by a set of x, d, with a back-propagation network 
that has been sufficiently trained to generalize to the future. This can be 
trained multiple times in the same network, with each training run yielding 
distinct synaptic connections. Cross-validation, a standard statistical method, 
serves as a guiding principle. The supplied dataset is divided into two groups 
at random: training and testing. After that, the training set is separated into 
two parts (Wen, J. et al., 2020). The majority of meteorological systems are 
characterized by temporal and spatial variability, as well as physical process 
nonlinearity, spatial and temporal scale conflict, and parameter estimation 
uncertainty. Neural networks can extract the relationship between a process’s 
inputs and outputs even if the physics are not explicitly stated. As a result, 
neural networks’ features are ideally adapted to the challenge of weather 
forecasting at hand. There are two steps to the back-propagation learning 
algorithm: propagation and weight update (Yoo, C. et al., 2019). 


Phase 1: Propagation 


The following steps are involved in each propagation: 


1. In order to generate the propagation’s output activations, forward prop- 
agation of a training pattern’s input is delivered through the neural 
network. 
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Figure 9.4 General framework of the proposed study. 


2. Using the training pattern’s target, back propagate the output activation 
through the neural network to generate the deltas of all output and hidden 
neurons. 


Phase 2: Weight Update 


For each weight-synapse: 


1. To calculate the weight’s gradient, multiply its input activation by its 
output delta. 

2. Add a ratio from the weight to move the weight in the gradient’s 
direction. 
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The learning rate is a fraction that improves the time duration and quality 
of the proposed approach. The sign of a weight’s gradient must be updated 
in the opposite area because it indicates where the inaccuracy is rising. 
The first and second phases are repeated until the network’s performance is 
acceptable. The following are the steps in back propagation. Figure 9.3 shows 
a completely linked feed forward back-propagation neural network, in which 
each layer’s neuron is connected to the layer before it. The network’s signal 
flow is in a forward direction, layer by layer, from left to right. 

Two distinct calculation passes may be detected when using the back- 
propagation process. The forward pass and the backward pass are both 
referred to as such. The synaptic weights are left unchanged in the forward 
pass, and the network’s function signals are computed neuron by neuron, 
as shown in Figure 9.3. As a result, the forward computation phase starts 
with the delivery of the input vector to the first hidden layer and ends with 
the computation of the error signal for each neuron in the output layer. The 
backward pass, on the other hand, begins at the output layer and iteratively 
computes the local gradient for each neuron by passing error signals layer 
by layer leftward through the sensitivity network. The network’s synaptic 
weights can change in response to the delta regulations, thanks to this cyclic 


Step 1. Initialize the weights in the network (often randomly) 

Step 2. Do 

Step 3. For each e in the training set 

a. O= neural-net-output (network, e) ; forward pass 

b. T = teacher output for e 

Step 4. Calculate error (T - O) at the output units 

Step 5. Compute Awh for all weights from hidden laver to output laver: 
Backward pass 

Step 6. Compute Awi for all weights from input layer to hidden laver: 
Backward pass continued 

Step 7. Update the weights in the network 

Step $. Until all e's classified correctly or stopping criterion satisfied 


Step 9. Retum the network 
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process. A neuron’s gradient at the output layer is just its error signal mul- 
tiplied by the first derivative of its nonlinearity. By conveying changes to all 
synaptic weights, the iterative computation is repeated layer by layer. Back 
propagation is an iterative process that begins at the top layer and works its 
way down until it reaches the bottom layer. It is acceptable to assume that 
the layer’s output error is known for each layer. When the output error is 
known, calculating changes to the weights to reduce the error is simple. The 
problem is that only the inaccuracy in the final layer’s output can be seen. 
Back propagation detects an error in a previous layer’s output by using the 
output of the current layer as feedback. As a result, the procedure is iterative, 
beginning with the last layer and computing weight changes. 


9.4 Result and Discussion 


Based on the training set provided to NN, a BPNN is utilized to pre- 
dict weather. It has been demonstrated that an intelligent system may be 
effectively integrated with an NN weather data prediction to predict rain 
and no-rain categorization through the use of this technology. This method 
improves convergence. This method is a simple conjugate gradient method. 
The back-propagation neural network approach to weather forecasting can 
produce good results and can be used instead of established meteorological 
approaches. This method can figure out the non-linear relationship between 
the historical data fed into the system during the training phase and create a 
prediction about what the forecast will be in the future. 


9.4.1 Dataset 


The Indian Meteorological Department of Tamil Nadu provided weather data 
for ten years (2001-2020). The weather data is divided into two groups: the 
training group, which accounts for 70% of the data, and the test group, which 
accounts for 30% of the data. Today’s weather forecasts rely on gathering and 
interpreting data and measurements from all across the globe. Weather.com 
and AccuWeather.com provided some of the misclassified data. Rather than 
giving common users with the opportunity to modify and interactively dis- 
cover prospective concerns linked with imminent weather hazards, it assisted 
meteorologists in studying and projecting personalized weather forecasts for 
a city or metropolitan area. There are 14 attributes in the data collection 
(Sheikh, F. et al., 2016). They are as follows: 


e Bar reading 
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e Wind direction 

e Mean sea level pressure 
e Maximum temperature 
e Dry bulb temperature 

e Wind speed 

e Minimum temperature 
e Vapor pressure 

e Cloudiness 

e Bar temperature 

e Relative humidity 

e Precipitation 

e Wet bulb temperature 

e Station level pressure 


A confusion matrix is a table that lists the actual and predicted categories 
in a classification system. The matrix data is commonly used to assess the per- 
formance of such systems. The accuracy is the percentage of correct guesses 
in the total number of forecasts. The true positive rate (TP), also known as the 
recall rate, is the proportion of positive events that are accurately identified. 
The percentage of accurately diagnosed negative instances is known as the 
true negative rate (TN). The false negative rate (FN) is the percentage of pos- 
itive instances that were categorized incorrectly as negative. Finally, precision 
(P) refers to the percentage of positive cases that are correctly predicted. Four 
classic performance criteria, namely sensitivity, accuracy, specificity, and 
precision, are used to validate all trials. In the performance metrics equations 
(Rajesh, P., & Karthikeyan, M., 2017), events that are defined are as follows: 


Precision = = 
TP+ FP 
E W EE A 
ensi Y TP+FN 
TN 
Specificity = FP+TN 
Accuracy = EA 
TP+ TN +FP+TN 


The back-propagation network for the given sort of weather pattern can 
be utilized to calculate classification statistics using a confusion matrix in 
Table 9.1. Figure 9.5 shows a performance analysis of weather forecasting. 

Operation of the receiver: Aside from confusion matrices, another way 
for evaluating classifier effectiveness is to use characteristic (ROC) graphs. 
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Table 9.1 Confusion matrix results of CNNs in weather prediction 


Rain No-rain Total 
Rain 192 68 260 
No-rain 42 218 260 
Total 260 260 


Weather prediction 
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E No-rain 


Rain No-rain 
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Figure 9.5 Performance analyses of CNNs. 


The false positive rate is plotted on the X-axis, while the true positive rate is 
plotted on the Y-axis, in an ROC graph. The point (0,1) is the best classifier 
because it correctly categorizes both positive and negative cases. Since the 
false positive rate is O (zero) and the true positive rate is 1, the answer is 
affirmative (0, 1) (all). A classifier with point (0, 0) expects all cases to be 
negative, while a classifier with point (1, 1) expects all cases to be positive 
(1, 1). The classifier at point (1, 0) is wrong in all classifications. A classifier 
may have a parameter that can be modified to increase TP at the expense 
of FP or decrease FP at the expense of TP in many cases. An (FP, TP) pair 
exists for each parameter setting, and an ROC curve can be produced using a 
sequence of these pairings. A single ROC point represents the (FP, TP) pair 
of a non-parametric classifier. 

For classification, all available rain and no-rain features are used. In 
table, the validation parameters, precision, sensitivity, and accuracy are 93.56, 
94.12, and 94.32, respectively. The 42 misclassified datasets here are related 
to rain features, whereas the 46 misclassified datasets are related to no-rain 
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Figure 9.6 ROC curve for CNNs. 


Figure 9.7 Validation performances for CNNs. 
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Table 9.2 Performance comparison results 


Accuracy Sensitivity Precision 
CNNs 93.56 94.12 94.32 
SVM 84.89 87.65 88.17 
Decision tree 81.23 82.22 81.98 
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Figure 9.8 Performance comparison results. 


features. The training time for the network is 18.46 seconds. Figures 9.6 
and 9.7 illustrate the receiver operating characteristic (ROC) curves for this 
investigation, as well as the best validation performance of 0.001 at 141 
epochs. 

Table 9.2 and Figure 9.8 show the performance evaluation results; the 
CNNs achieve the better results in terms of accuracy, sensitivity, and preci- 
sion in 93.56, 94.12, and 94.32. The other conventional approaches such as 
support vector machine achieve 84.89, 87.65, and 88.17, and decision tree 


algorithm achieves 81.23, 82.22, and 81.98. Comparatively CNNs produced 
better performance results. 


9.5 Conclusion 


Accurate weather forecasting is essential for day-to-day activity planning. 
Many real-time applications, such as weather forecasting, have used neu- 
ral networks. Based on numerous factors acquired from the meteorological 
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department, a neural network model for weather forecasting has been con- 
structed. Because of their simplicity and robustness, neural networks have 
become very popular in weather prediction. In this study, an application of 
neural networks model was employed to forecast weather for Tamil Nadu, 
India. The advantage of neural networks lies in its computational speed and 
its capability of adapting to changing information. Neural network finds its 
application in weather forecasting domain. This research has focused on 
the application of artificial neural network in weather forecast classification 
which helped in identifying weather prediction for future. Online training, 
fault analysis system design, error prediction and removal, probability anal- 
ysis, and nonlinear equalization are some of the future uses of the proposed 
weather prediction. Long-term data may be used for around 20 or more years. 
The other classifier techniques, namely statistics and genetic algorithms, can 
be applied in future study. 
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Abstract 


Higher educational systems are accountable for ensuring effective E-learning 
(EL) environments for online learners. An effective learning environment 
engages learners in educational activities. The chapter has three parts; in 
the first part, we discussed the convolutional neural network (CNN). CNN 
has many models, but for the purpose of this study, we have applied three 
models and found them to be most appropriate to measure students’ engage- 
ment (SEt) in EL assignments. We have applied all convolutional networks 
(All-CNN), network-in-network (NiN-CNN), and very deep convolutional 
network (VD-CNN) because they have simple network architectures and 
show efficiency in conditions and categories. These categories are based on 
the conditions of learners for their facial expressions in an online environ- 
ment. In the second part of the chapter, we cover the methods of application 
and benefits of machine learning (ML) and artificial intelligence (Al) in King 
Khalid University’s (KKU) E-learning. The third part of the chapter covers 
the role of Internet of Things (loT) in the education sector and defines the 
advantages, types of security concerns, and challenges of deployment of IoT. 
This research is descriptive in nature; results for the application of three 
models of CNN are referred for their advantages and challenges for online 
learners, and results for ML and Al are based on qualitative analysis done 
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through tools and techniques of EL applied in KKU’s learning management 
services (LMS) and blackboard (BB). Results for loT show the benefits for 
both students and educators. 


Keywords: Convolutional neural network, artificial intelligence, Internet of 
Things, E-learning, machine learning. 


10.1 Introduction 


The past decade has witnessed a drastic development in EL and identified 
applications of several web-based learning analytical tools in its practices 
[1]. EL has facilitated individual learning and group learning in an easy 
manner and also enhanced the cost advantages [2]. EL has included all types 
of target group in its learning areas and defined the advantages for them in 
higher education and professional environment and added to their expertise 
in their knowledge areas [3]. Disadvantages are also described for EL in 
comparison to face-to-face (F2F) learning. In F2F environment, the instruc- 
tors have a great opportunity to understand students’ behavior, their interests, 
physical gesture, and facial expressions. These symbols help instructors to 
know the students’ participation in learning, but in the EL environment, this 
important part is found to be a limitation [4]. This limitation concludes in 
confusion in evaluating SEt in learning, completing assignments, and other 
related educational endeavors. In this situation, the CNN has brought a great 
advantage. A machine- or technology-based solution is also suggested for 
identifying the students’ involvement through various ways such as follow- 
ing their mannerism, time they devoted on online learning, their questions, 
and answering them during the online learning session. Monitoring makes 
students more careful and they are motivated to involve in the process of 
learning. This is one good option to reduce stress level and anxiety amongst 
the students and eventually can reduce the number of withdrawn students. 
Learning which is based on competence and critical thinking requires more 
involvement at an internal level as well as an external level. These internal 
and external levels measure the SEt by the help of CNN. These internal and 
external features include perception, observable facial description, physical 
gesture, verbal communication, and conduct [5]. It is difficult to measure 
involvement internally. But the external apparent factors can be assessed 
using new sensing and affective computing techniques using video [6], 
audio [7], and physiological signals [9]. The sensitivity of students can be 
contingent from these measures via affective computing that is increasingly 


10.1 Introduction 263 


being used in learning technologies [10]; however, their applications in online 
learning have not been widely applied yet. Facial image analysis is the widely 
accepted external factor for facial gestures and sensitivity identification. 
Facial features represent poignant conditions such as awkwardness, antipathy, 
and contentment, which show imperative function in students’ expressions 
of frustration and involvement during education. Good improvement has 
been applied, but there are still a lot of limitations that exist related to the 
appropriate mapping of facial expressions with the students’ emotion and 
involvement identification. Deep learning techniques (DPT) have given great 
progress in computer vision, but, to some extent, this has not been used 
to measure the SEt in EL. In this chapter, we examine the applicability 
of the (CNN), DPT for SEt detection in EL. Specifically, three popular 
models of CNN and a proposed model are tested for SEt recognition using 
facial expressions. The three popular models include All-CNN [11], NiN- 
CNN [12], and VD-CNN [13]. We observed that each of these architectures 
has some unique characteristics that can help to increase the accuracy of a 
typical CNN model. The application of multilayer perceptron can increase 
the depth of the network by using small convolutional filters and, finally, 
replace some max-pooling layers by convolutional layer with an increased 
stride. We identified that the accuracy of the above three models and the 
proposed model in estimating students’ perceived SEt (external observable 
factors, such as facial expressions) in online learning. Since teachers rely 
on external observable factors to judge perceived SEt to adapt their teaching 
behavior, the automation of perceived SEt identification is likely to be useful 
in EL to offer special help to the students in need. 

EL is witnessing its growth as never before, from the educational sector 
to the commercial environment, but without ML and AI, this growth would 
have never been possible. The introduction of ML and AI has filled the 
distances of communication [14]. ML and AI are sketching the future growth 
of EL by using the techniques of prediction, and algorithms to develop more 
individual-centric EL practices. This paper attempts to answer the methods 
and processes to achieve this objective. As an example, KKU EL deanship 
is used to observe the role of ML and AI in EL growth. This paper has 
tried to answer the possibilities of changes in EL with more advancement 
in ML and AI. The paper is segmented into three parts. The first part presents 
the literature review where a description and introduction of EL in KKU 
is given specifically. Also, the general historic vision of EL is given along 
with the growth of ML and AI from past to present. The second part is the 
discussion on various areas of ML and AI in general and in context to EL 
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Figure 10.1 IoT applications in education [20]. 


in KKU specifically as of educational sector that covers the definition of ML 
and AI, ML classification, benefits of ML and AI in EL, EL transformation 
due to ML, and AI and ML application. All these areas are discussed with 
reference to KKU EL deanship. The last part provides the qualitative results 
on expected changes in EL in KKU after the application and determination of 
ML and AI [15, 16]. Besides the above-mentioned three parts, a short note on 
predictive limitations is also given which is completely based on researchers’ 
experiences with ML and AI in EL environment. 

The Internet of Things (IoT) has the capabilities to transform education 
by intensely altering how educational academies collect data, interface with 
users, and automate processes [17]. The applications of IoT are related to 
the networking of physical objects through the use of embedded sensors, 
actuators, and other devices. This process collects and transmits information 
about real-time campus activity in educational sector [18]. New learning 
environment is developed by the implementation of IoT because it is able 
to integrate user’s mobility and data analytics. Figure 10.1 shows the tasks 
that can be performed by IoT [19]. 

This research paper contributes in explaining the role of emerging 
technologies in education sectors and illustrates the guidelines as well as 
advantages of their applications which can be followed by similar types of 
academies [21]. Figure 10.2 gives the list of the main components of IoT that 
can be used in educational sector. 
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Figure 10.2 Components of loT [22]. 


CDS are the primary physical devices which are connected to the edu- 
cational learning system. These devices establish the connectivity of all the 
equipment necessary for the learning operations. CC is also a device, but their 
job is to manage the data and ensure the security, follow protocol, and manage 
traffic on network. CC also scales the hardware as per the requests of clients 
[23]. These clients are students, faculty, IT professionals, or other university 
staff. DC stores the learning material and other useful data of university 
students and employees. loT provides UI for the education clients for two- 
way communication. A major issue of using IoT is security problems, but 
SS helps in solving this problem and assures academicians to use the system 
with our apprehensions. Finally, DA provides analysis of data in a systematic 
manner. DA is an example of computational analysis which can be applied 
for various benefits in education system such as data interpretation and other 
decision-making processes for learning and teaching [24]. 


10.2 Literature Review 


At a high level, SEt identification methods can be divided into three main 
categories: manual, semi-automatic, and automatic [25]. The manual category 
refers to the methods where learners’ direct involvement is needed in SEt 
detection. This category includes observational checklist and self-reporting 
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techniques. The observational checklist relies on questions completed by 
external observers (e.g., teachers) instead of the learners. The self-reporting 
poses a set of questions in which students report their own level of attention, 
distraction, excitement, or boredom [26]. Self-reporting is of great interest 
to many researchers because it is easy to administer and can provide some 
useful insight into students’ SEt [27]. However, their validity depends on 
a number of factors like learners’ honesty, their willingness to report their 
emotions, and the accuracy of learners’ perception of their emotions [28]. The 
semi-automatic category includes the methods of SEt tracing. The SEt tracing 
utilizes the timing and accuracy of learner responses to practice problems and 
test questions [30]. Although these methods have been used in classroom- 
based learning and intelligent tutoring systems, not many applications of 
these methods can be found in online learning [31]. The automatic category 
includes computer-vision-based methods, sensor data analysis, and log-file 
analysis. Among these methods, the computer-vision-based methods are 
found to be more suitable to use in online learning as these are unobtrusive 
to the users and the hardware and software to capture and analyze the data 
are wide-spread available at low cost. The typical computer-vision-based 
methods use facial expressions, gestures and posture, and eye movement. 
Some research studies combine more than one modality to achieve better 
accuracy. A good deal of information used by humans to make SEt judgment 
is based on human faces, and it has been hypothesized that facial expressions 
are directly linked to the perceived SEt [32]. Cameras provide a continuous 
and non-intrusive way of capturing face images when the learner uses a 
mobile device or a personal computer for learning activities. The captured 
facial information is used to understand certain facets of the learner’s current 
state of mind. Many different methods have been proposed to automate this 
detection process by analyzing the face images [33]. Based on how the 
information from a face appearance is used, these methods are divided into 
two groups: part-based and appearance-based [34]. Part-based methods deal 
with different parts of a face (e.g., eyes, mouth, nose, forehead, chin, and 
so on) for the SEt detection. A comprehensive way to analyze the parts of a 
face is the Facial Action Coding System (FACS) [35]. In appearance-based 
methods, features from whole-face regions are extracted and used to generate 
patterns for SEt classification. Among different feature extraction techniques, 
local binary patterns (LBP) and histogram of oriented gradients (HOG) are 
found to be popular in appearance-based methods [36]. SEt detection using 
CNN includes a generic architecture of CNN which is a combination of 
deep learning technology with ANNs. A generic CNN architecture typically 
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Figure 10.3 Typical architecture of CNN [35]. 


contains an input layer, multiple hidden layers, and an output layer. In the 
hidden layers, there can be a different combination of convolutional layers, 
activation layers, pooling layers, normalization layers, and fully connected 
layers. Figure 10.3 shows a typical architecture of CNN as presented in [35], 
where basically the input, convolutional, pooling, fully connected, and the 
output layers are illustrated. Each feature map can detect the presence of 
single feature at all possible locations. Each output of the convolutional layer 
is then passed through an activation layer which uses an activation function 
to decide the final value of a neuron. The activation function transforms the 
linear combination of features into non-linear so that the neural network can 
learn faster with high accuracy. 

Most CNN architectures use one or more fully connected layers before 
the output layer, which is a typical MLP. All neurons in this layer are fully 
connected to all activations in the previous layer. A typical CNN model 
works with minimizing a loss function which is computationally feasible and 
represents the price paid for inaccurate predictions. The main reason for the 
model’s popularity of ALL-CNN is that the model achieves high accuracy 
in different benchmarking datasets with its simple network architecture in 
classification. This model is different from a standard CNN model mainly 
in two key aspects. First, this model replaces the pooling layers by using 
standard convolutional layers with an increased stride. Second, this model 
makes use of small convolutional layers with kernel size which greatly 
reduces the number of parameters in a network and thus serves as a form of 
regularization. NiN-CNN Typical architecture of CNN [37]. uses a “micro- 
network” structure to approximate the nonlinear functions in CNN and install 
“micro-network” architecture (MLP). This deep model leads to severance 
between latent features, which help to achieve better abstraction and accuracy. 
In a traditional CNN, the feature maps of the last convolutional layer are 
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flattened and passed on to one or more fully connected layers, which are then 
passed on to softmax for the classification. In VD-CNN [37], the authors 
address the aspect of depth in CNN. In this architecture, authors increase the 
depth of the network by adding more convolutional layers with stride 1. The 
authors argue that a stack of two 3 x 3 layers without any pooling layer in 
between has an effective receptive field of 5 x 5, where the three such layers 
have a7 x 7 effective receptive field which has various advantages. 

The EL deanship (ELD) at King Khalid University was established in 
the year 1426 H (as per Arabic Calendar) as part of the continuous online 
learning in KKU. EL has tried to use the best of techniques in developing 
and improving its educational services. In a general context, preeminent 
examples of using ML and AI are in the development process of LMS for 
any educational system. From the introduction time, EL canters have been 
performing various researches and using trained IT staff to enhance education 
systems, develop online learning skills, and apply the best of online expertise 
in imparting knowledge [38]. In the current scenario, EL deanship in KKU 
has launched many new online services to achieve a new level of success in 
the online education sector. KKU’s vision for EL is to take “KKU's human 
resource at the highest skills and empower them to fulfil their changing needs 
and aspirations through using embedded EL” [39]. 

To meet these objectives EL in KKU has recently introduced Advanced 
Google Classroom, Mediasite for online streaming of lectures, etc. EL has 
also focused on advancing LMS for effective communication and feedback 
beyond the traditional learning environment [40]. Many external applications 
have also been developed by focusing on education sharing through Google 
Classroom and other tools such as YouTube, Google Docs, Google Mail, 
Task Manager, and Google Analytics [40]. These are the focal tools of the 
development programs of EL in the extensive references. These applications 
have enabled all online users to share information at all levels for learning 
and teaching (L&T) purposes [41]. 

History of EL: The expression “EL” is not very old; it was introduced in 
1999, initially employed at a CBT [42] systems seminar. As a synonym, many 
other expressions too are used like “online learning” and “virtual learning” 
for EL [42]. EL has taken extensive development over a period of time and 
has achieved a much deeper meaning in the 19th century [42]. Arthur Samuel 
was an American initiator in the field of PC gaming and AI, and, in 1959, 
while working at IBM, he introduced the concept of ML [43]. In the 1960s, 
Nilsson authored a book on learning machines, explaining the management 
systems of ML for design classification [43]. 
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Figure 10.4 History of EL [42]. 


The enthusiasm for the design and application of ML was identified in 
1970 and 1973; Duda and Hart portrayed the same in their book “Pattern 
Classification” [44]. As a logical undertaking, ML and AI have become 
very popular applications from past to present. Scholars, scientists, profes- 
sionals, and academicians all use this for machining the information. They 
use ML and AI for the application of representative strategies, named as 
“neural systems”; these are for the most important part of perceptrons [45]. 
Another application of ML and AI is probabilistic thinking [45] which 
is mostly utilized and mechanized in the clinical analysis by the medical 
practitioners [45]. 

ML and AI provide accentuation on a methodology that helps in expand- 
ing the sensible processing of information. In 1980, probabilistic frameworks 
were beset by hypothetical and viable issues of information procurement 
and representation [46], and in the same year, master frameworks were 
developed to command and measure man-made intelligence [46]. Work on 
emblematic/information-based learning continued inside man-made intel- 
ligence, prompting inductive rationale programming. However, the more 
factual line of researches is outside the field and domain of AI appropriateness 
in design acknowledgment and data retrieval [46]. Neural systems exploration 
had been relinquished by simulated intelligence and software engineering 
around a similar time during the 1980s. This line, as well, was preceded 
outside the man-made intelligence/CS field, called “connectionism,” by the 
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scientists from different orders including Hopfield, Rumelhart, and Hinton. 
Their fundamental achievements came in the mid-1980s with the reevaluation 
of backpropagation [47] which was another way to say “in the reverse spread 
of mistakes,” defined as a calculation for administered learning of fake neural 
systems utilizing angle plunge. ML revamped as a different field and began to 
thrive during the 1990s. The field changed its objective from accomplishing 
AI to handle resolvable issues of a pragmatic sort. AI moved away from 
concentrated form to emblematic methodologies which were acquired from 
man-made intelligence, derived from strategies and models obtained from 
insights and “likelihood theory” [49]. AI likewise profited by the expanding 
accessibility of digitized data and the capacity to disseminate it through the 
web. The past two decades have shown the drastic applications of IoT in many 
sectors; education is also one of them where IT professionals have developed 
a new platform of learning and teaching by the use of IoT. It is expected in 
2022 that loT will take a great lead in secured online assessments for higher 
educational system [49]. Research works in 2017 have discussed that 67% 
of primary and secondary education systems in developed nations such as 
US, UK, Germany, and Australia have already applied IoT in their teaching 
system and this percentage is increasing sharply every year [49]. Research 
conducted in 2018 showed the concerns of educational systems about security 
and privacy for the loT in education. IoT has caused cyber security threats 
and aggravated the network attacks. Also, studies conducted in 2019 show 
that major attacks were planned by the students for altering their grades and 
attendance. IT professionals have developed more advanced security systems 
in 2019 to prevent such cyber-attacks in the academies and revolutionize 
the application of IoT in education. In the year 2017, IoT has facilitated in 
preventing distributed denial of service (DDoS) attacks that were designed to 
intermittently bring down the University of Michigan’s computing network. 
The year 2020 explained the relevance of IoT in education sectors for all 
countries because offline learning was suspended due to the spread of COVID 
19 [49]. Past research works have focused on single application of technolo- 
gies in learning system, whereas this paper presents the example from the 
higher education system presenting how these technologies were successfully 
implemented. 


10.3 Discussion 


This chapter proposes a new CNN architecture where we incorporate dif- 
ferent advantageous features from these three base models. Unlike using 
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Table 10.1 Specific architectures for the combined model and the three base models 


All-CNN VD-CNN NIN-CNN Combined models 
Input 32x32 Grayscale Image 
3x3 conv. 96 ReLU 3x3 conv. 96 ReLU 5x5 conv. 192 ReLU 3x3 conv. 192 BatchNorm ReLU 
3x3 conv. 96 ReLU 3x3 conv. 96 ReLU 1x1 conv. 192 ReLU 1x1 cov. 192 ReLU 
3x3 conv. 96 ReLU 1x1 conv. 192 ReLU 
3x3 conv. 96 ReLU 3x3 max-pooling with stride 2 3x3 max-pooling 3x3 conv. 192 BatchNorm ReLU 
with stride 2 dropout (0.5) with stride 2 
dropout (0.5) 
3x3 conv. 192 ReLU 3x3 conv. 192 ReLU 3x3 conv. 96 ReLU 3x3 conv. 96 BatchNorm ReLU 
3x3 conv. 192 ReLU 3x3 conv. 192 ReLU 1x1 conv. 96 ReLU 1x1 conv. 96 ReLU 
3x3 conv. 192 ReLU 1x1 conv. 96 ReLU 1x1 conv. 96 ReLU 
3x3 conv. 192 RELU 3x3 max-pooling with stride 2 3x3 max-pooling 3x3 max-pooling with stride 2 
with stride 2 dropout (0.5) dropout (0.5) 
3x3 conv. 192 ReLU 3x3 conv. 192 ReLU 3x3 conv. 32 ReLU 3x3 conv. 32 BatchNorm ReLU 
1x1 conv. 192 ReLU 1x1 conv. 192 ReLU 1x1 conv. 32 ReLU 3x3 conv. 32 BatchNorm ReLU 
1x1 conv. 32 ReLU 3x3 conv. 32 BatchNorm ReLU 
1x1 conv. 3 1x1 conv. 3 1x1 conv. 3 1x1 conv. 3 
global average pooling global average pooling global average pooling global average pooling 
3-way softmax 3-way softmax 3-way softmax 3-way softmax 


homogeneous blocks as in the base models [50], we use heterogeneous 
blocks where we keep the network depth limited for achieving computational 
efficiency. This chapter suggests applying the combination of three CNN 
models and the specific architectures for the combination of model, and the 
three base models are described in Table 10.1. 

For the models training and testing, Figure 10.5 shows the relationship 
between the loss function and epoch times for the training and validation sets 
for All- CNN, VD-CNN, NiN-CNN, and the combined model. 

The relationship among the loss function of the training set, validation 
set, and epoch times, and the relationship among the accuracy of the training 
set, validation set, and epoch times are shown in Figure 10.6. 


10.3.1 Definition of ML and Al 


ML is not a separate branch; it has emerged from Al that includes algorithms 
for performing predictions and giving outcomes. The entire process depicts 
a type of pattern and then concludes learning extracted through data. Also, a 
parallel process runs where all novice information received is first analyzed 
and used to predict the user's behavior. It is relevant to note here that LMS 
[51] is benefitted from this process as it provides a personalization outlook 
for the users. 
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Figure 10.5 Relationship between the loss function and epoch times for the training and 
validation sets for (a) All-CNN, (b) VD-CNN, (c) NiN-CNN, and (d) the combined model. 


10.3.2 Definition of ML and Al in KKU EL 


EL in KKU has also utilized the significance of ML and Al in LMS, where 
student's online access of data is evaluated and aids in calculating the duration 
and determining which tool students have used while working on LMS in EL. 

In defining ML, two types of ML frameworks are identified: proprietary 
and open-source [52]. 

Proprietary and open-source are two types of examples of deep learning 
[52]. EL in KKU is using proprietary deep learning software more than 
open-source. EL in KKU is using different tools developed by Google for 
its performance such as tensor process units [53] and other academies have 
a choice to implement proprietary or open-source as per their educational 
requirements. 


10.3.3 ML Classifications for KKU EL 


ML is respected to identify data of a user and give results in form of patterns. 
These patterns are given by ML using algorithms to forecast the effect. EL in 
KKU has successfully implanted ML’s algorithm classification [53] in LMS. 
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Figure 10.6 Relationship between the accuracy and epoch times for the training sets and 
validation sets for All-CNN, VD-CNN, NiN-CNN, and the combined model. 


Table 10.2 explains the classification of ML [54] into three categories: 
supervised, unsupervised, and reinforcement [54]. EL in KKU is using all 
three applications in the development of various online tools and techniques 
in LMS services like category one is used by information technology (IT) 
specialists in developing a new interface for BB on LMS services where the 
IT human resources (HR) are predicting new datasets based on past datasets. 
For the second category, EL in KKU is using a sub-set category known as 
“semi-supervised” [14], where IT HR tries to offer the system with accurate 
input and output relationship in building any new LMS platform. To meet this 
objective, EL deanship conducts extensive IT HR training. IT HR follows 
optimum methods of evaluating output sets in the third category of ML in 
EL in KKU. This is an example of the process of learning from reward and 
errors. 
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Figure 10.7 ML as a subpart of AI [51]. 


Figure 10.8 ML framework [51]. 


Figure 10.9 ML classification [54]. 
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Table 10.2 Definition of ML classification 


Classification of ML [2, 11] 
Supervised [2, 11] Unsupervised [2, 11] Reinforcement [2, 11] 
This classification is based This explains no fixed As from the meaning, it 
on trend analysis where the | rules on classifications. focuses on fixed 
system takes instances from | The system analyzes the objectives that a system 
previous extracted and datasets to conclude the has to reach. 
novice to give expected patterns and give some This classification is 
data. conclusions. based on reviews and 
Organizational programmer | This classification is very feedback to achieve the 
has to provide both input helpful in case of desired milestone. 
and output to the systems ambiguous data fashion; It is like learning by 
for the perfection of the however, it is not mistakes and 
software. Repeated practice | concerned with the effectiveness. 
makes the autonomous alignment of input and 
process for constructing output. 
targets and new datasets. 


10.3.4 The Benefits of ML and Al in KKU EL [55] 


EL in KKU is using the best of benefits provided by ML and AI for its 
online learners and instructors. The focus is on the present and future of 
LMS practices. This is mostly for new LMS [55] where the idea is to confer 
predictive algorithms and robotic delivery of EL contents. Figure 10.10 gives 
the benefit of ML and AI in KKU EL in general. 


More Personalized EL Content 
Better Resource Allocation 


Automate Scheduling and Content Delivery 
Process 


More Personalized EL Content | 
The Benefits of ML and8 


Improve EL ROI 


Improve Leamer Motivation 
Create More Effective Online Training Programs 


Figure 10.10 Benefits of ML and AL in EL [55]. 
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Table 10.3 Benefits of ML and AL in EL [55] 


Benefit of ML and Al in EL KKU 


Customized EL Content (4 
EL im KKU is using ML algorithms to 
predict results, which allow to provide 
specific EL content based on past 
formance and individual kaming 
goals. If the online user is more active 
and frequent user of any tool on LMS 
and BB, the systems robotically 
the recommendation to the 
users®! also EL is using ML and Al to 
show the gap and excellence in 
leaming skills presented in its users. 
ML and Al benefit the KKU online 
users how to give more customized 
kaming materials. The system 
recommends more integrated, basic or 
complex courses to the online kamers 
based on their online behavior. 


Resource Allocation"! 
ML and AI offer two aids in resource 
allocation. One for educational sector 
and one for corporate! EL in KKU is 
benefitted by educational aspect where 
online leamers on LMS gets the 
pala ‘This builds 
the bridge for their skills and facilitate 
in meeting leaming outcomes. they 
require to fill gaps and achieve their 
kaming goals!“ The second benefit of 
ML and Al im resource allocation ,EL in 
KKU identified is for HR team. Now 


contents for LMS !1_ 


Automate Content 
Scheduling Process "1 
Developing tools on LMS is not an casy 
practices and task but with ML and AI 
such complex, ambiguous and time 
taking tasks become casier. EL 
deanshp in KKU schedule the 
coursework for online lamers and 
deliver online resources. This technique 
depends on their EL assessment results 
or simulation presentation. EL in KKU 
expects that AI in coming days with 
make this process robotic and create a 
niche in EL modules (9 for all online 
leamer who are participants on LMS. 


Delivery and 


KKU EL Retum On 
1 


EL in KKU witnessed better retum on 
investment with the ML and AL; 
resultant in creating more customized 
approach and better revenue eamings. 
With the help of predictive analysis 
online leamers takes online training 
promptly that cause in time 
effectiveness for other endeavors. 
Apart from this EL in KKU use Al 
equipped software that could monitor 


Improve Leamer Motivation!* 

EL deanship believes that the ML 
systems and Al in future can be likened 
to a private virtual teaching Many 
online eaming programs have been 
developed following the concepts of 
ML and Al by the IT HR. For example 
KKU X under EL deanship , used ML 
and Al in establishing distinguished 
contents for young Saudi students to 
develop skills to prepare the graduate 
for potential employment"! . 


Online Training Programs!*! 

ML and Al is implemented by EL im 
KKU in making peer- 
communication productive. Also Al 
created a mapping between Electronic 
experts and leaming on LMS platform 
in KKU. 


and predict online user’s behavior on 
LMS. 


Researchers have discussed how EL in KKU is taking the benefits from 
ML and Al in its LMS practices. Table 10.3 gives a brief description of 
receiving benefits from ML and AI in LMS and other online platforms. 


10.3.5 ML and Al are Transforming the EL Scenario in KKU 


Consider a scenario where an online user can make EL substance and 
afterward let the framework deal with the more repetitive assignments, 
for example, surveying outlines and measurements to recognize concealed 
examples. KKU has imaged the scenario where IT HR prompts customized 
EL criticism and steers online students the correct way with no human 
intercession or automatically. ML and AI can computerize the off-camera 
work that requires a lot of time and assets [55], and that is how EL in 
KKU is benefitting in transformation and, finally, Al-assisted the IT HR in 
encountering relevant errors in the LMS environment. ML and AI also help 
IT HR of El deanship in customizing online users’ learning choices based 
on their previous usage, performance, and work requirements. Figure 10.11 
provides a list of an application under ML and AI applied in EL in KKU. 
This includes research available tech tool (RTT), current big data collection 
(BDC), ML’s role in online training strategy (MLOTS), and future game plan 
for online learning (FGOL). 
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Research Available Tech Tools 


Current Big Data collection 


transformir 


ML’s Role in Online Training 
Strategy 

Future Game Plan for Online 
learning 


Figure 10.11 ML and Al application of EL in KKU [56]. 


All the applications of ML and Al are explained in the context of EL in 
KKU and similar plans that other institutions can work too. Table 10.4 gives 
a brief account of the application. 

TT AR in EL in KKU did a research and selected LMS and EL technologi- 
cal tools with the latest integration of ML and AI. For instance, there are many 
EL applications [56] having algorithms and computerization topographies 
constructed in advance and EL in KKU is achieving benefits from them. As a 
result, these applications have created an efficient platform for LMS practices 
[57] in KKU. 

IT HR in El deanship is well informed about the fact that any data cannot 
be utilized completely unless ML and AI give a factual report. IT HR is 
significantly assembling data with the help of ML and AI for EL in KKU; 
however, some of the data is either not utilized or not relevant for any digital 
training on LMS in KKU. The reason is the limitation as no good methods 
can explain the relevance of data applied while integrating in algorithm and 
predictive analytics. Also, IT HR realized that ML needs to achieve an entire 
impression and not just the outline of some short duration. Therefore, they 
integrate and arrange data received from LMS, and other online resources 
such as social network and web sources to elaborate the pattern. 

EL in KKU has understood the advantages of ML in describing the online 
learning and teaching (L&T) strategies and, therefore, applied in the LMS 
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Table 10.4 ML and AL applications of EL in KKU 


RTT [58] 

IT HR in EL in KKU did a research and selected LMS and EL technological tools 
with the latest integration of ML and AI. For instance, there are many EL 
applications [10] having algorithms and computerization topographies constructed in 
advance and EL in KKU is achieving benefits from them. 

As a result, these applications shave created efficient platform for LMS practices [11] 
in KKU. 


BDC [58] 

IT HR in EL deanship is well informed about the fact that any data cannot be utilized 
completely unless ML and AI give a factual report. IT HR are significantly assembling 
data with the help of ML and AI for EL in KKU; however, some of the data is either 
not utilized or not relevant for any digital training on LMS in KKU. The reason is the 
limitation as no good methods can explain the relevance of data applied while 
integrating in algorithm and predictive analytics. Also, IT HR realized that ML needs 
to achieve entire impression and not just the outline of some short duration. Therefore, 
they integrate and arrange data received from LMS, and other online resources such as 
social network and web sources to elaborate the pattern. This concluded IT HR in EL 
deanship to provide trend for online training for LMS users in KKU. 


MLOTS [58] 

EL in KKU has understood the advantages of ML in describing the online learning 
and teaching (L&T) strategies and, therefore, applied in the LMS platform. However, 
limitations have not been over looked for the complete dependence on ML and AI by 
the EL in KKU. Therefore, EL deanship developed and applied a combination of big 
data and in-person contribution to develop online L&T strategies for LMS. EL in 
KKU has a vision for ML and AI; all the IT HR are working to result the effective of 
online L&T strategies and how it will be met by the use of ML and AI. These 
strategies are being checked on trend analysis. Unless EL in KKU does not become 
confident, it decides not to overrule the in-person contribution and make the entire 
system automatic. 


FGOL [58] 

Based on the working of MLOTS, EL in KKU cannot elaborate the fixed date for 
complete application of ML and AI for its LMS and other online services like online 
training for L&T. Rather, IT HR is focusing on tentative plan and applying ML and AI 
for the effective outcome for each activity on LMS and other services. 


platform; however, limitations have not been overlooked for the complete 
dependence on ML and AI by the EL in KKU. Therefore, El deanship 
developed and applied a combination of big data and in-person contribution 
to develop online L&T strategies for LMS. EL in KKU has a vision for ML 
and AI, and all the IT HR are working on results for effective online L&T 
strategies and how it will be met by the use of ML and AI. 
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10.3.6 Customized EL Content [57] 


EL in KKU is using ML algorithms to predict results, which allow providing 
specific EL content based on past performance and individual learning goals. 
If the online user is a more active and frequent user of any tool on LMS and 
BB, the systems robotically provide the recommendation to the users [57] 
and EL is using ML and AI to show the gap and excellence in learning skills 
presented in its users. ML and AI benefit the KKU online users by teaching 
how to give more customized learning materials. The system recommends 
more integrated, basic, or complex courses to the online learners based on 
their online behavior. 


10.3.7 Resource Allocation [57] 


ML and AI offer two aids in resource allocation: one for the educational 
sector and one for corporate [6]. EL in KKU is benefitted by educational 
aspect where online learners on LMS get the accurate online resources. This 
builds the bridge for their skills and facilitates in meeting learning outcomes. 
They require filling gaps and achieving their learning goals [57]. The second 
benefit of ML and AI is resource allocation; HR teams in EL in KKU apply 
ML and Al in resource allocation. Now, they are taking minimum time in 
analysis and developing powerful contents for LMS [57]. 


10.3.8 Automate Content Delivery and Scheduling Process [57] 


Developing tools on LMS is not an easy practice and task, but with ML and 
AI, such complex, ambiguous, and time-consuming tasks become easier. EL 
deanship in KKU schedules the coursework for online learners and delivers 
online resources. This technique depends on their EL assessment results or 
simulation presentation. EL in KKU expects that AI in coming days will make 
this process robotic and create a niche in EL modules [58] and for all online 
learners who are participants on LMS. 


10.3.9 Improve KKU EL Return on Investment [58] 


EL in KKU witnessed better return on investment with the ML and AI. 
This has resulted in creating more customized approach and better revenue 
earnings. With the help of predictive analysis, online learners take online 
training promptly that cause in time effectiveness for other endeavors. Apart 
from this EL in KKU, using Al-equipped software can monitor and predict 
online user’s behavior on LMS. 
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10.3.10 Improve Learner Motivation [58] 


EL deanship believes that the ML systems and Al in future can be likened to a 
private virtual teaching. Many online learning programs have been developed 
by following the concepts of ML and AI by the IT HR. For example, KKU X 
under EL deanship used ML and Al in establishing distinguished contents for 
young Saudi students to develop skills to prepare the graduate for potential 
employment [58]. 


10.3.11 Online Training Programs [58] 


ML and AI are implemented by EL in KKU in making peer-to-peer com- 
munication productive. Also, AI has created a mapping between electronic 
experts and learning on LMS platform in KKU. 

EL in KKU has assumed that ML and AI will certainly make a noticeable 
development in the future growth of EL. These applications have particularly 
given various advantages such as connecting individual online students to 
a group of other online users and finally to the wider remote associations. 
Researchers have identified the limitation in the scope of ML and AI. Results 
have not adequately proved the range of advantages and capacity of the 
assistance of using ML and AI. 

KKU has applied the IoT solutions for enhancing the security and pre- 
venting cyber-attacks. The academy has a strong cybersecurity unit that 
applies IoT to prevent all types of network attacks and threats to the sys- 
tem. This university is using IoT for effective learning and redefining the 
roles of all users such as students, teachers, and administrators. Now, these 
users can interact and connect to technology and devices in classroom envi- 
ronments, aiding learning experiences, improve educational outcomes, and 
reduce costs. Some of the examples of IoT solutions used by KKU are given 
in Figure 10.12. 

There are many benefits for which loT can be used, and KKU is using 
some of them very effectively. Other benefits of loT in education are given in 
Figure 10.13 which are used by other academies in developed nations. 

There are many benefits of IoT that educational sectors have identified; 
however, they have to face major challenges also in its applications. IoT is 
the reason for huge data flow that has helped in increasing performance at 
operational as well as management level but, at the same time, has raised the 
security issues too. To solve these issues, universities and schools have to 
increase the infrastructure to manage the security at the network level [59]. 
Many academies including KKU have adopted a strategy of using traditional 
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Figure 10.12 IoT in learning systems in KKU. 
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Figure 10.13 IoT benefits in education [59]. 


network designs to offer innovative methods of solving security problems and 
introducing new hierarchy of network intelligence, automation, and security. 
With this objective, the university has developed a simple, automated process 
for loT device on boarding and avoided applying large IoT (LIoT) systems 
[59]. This option helped the academy because LIoT systems can have a big 
number of sensors or mechanics that can cause issues in management and 
cause more complex errors. It can provide a secure environment against 
cyber-attacks and data loss. KKU has implemented the security at multiple 
levels, including control of the loT networks. For protecting IoT traffic and 
devices in the academy, KKU has its own strategic approach which is taking 
advantage of multi-security safeguards instead of single security technology. 
KKU introduced high quality of network connectivity for its users which in 
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return assures better working and secured data delivery [59]. Therefore, KKU 
applied IoT system in its branches and colleges to implement smart planning, 
smart learning systems, and smart design in the online as well as face-to-face 
learning. 


10.4 Results 


The results for the first part shows the averages of the recognition accuracy in 
test datasets for the three-level SEt and the combined CNN model achieved 
higher accuracy than All-CNN, VD-CNN, and NiN-CNN models. Reliable 
models that can recognize learners’ SEt during a learning session, particularly 
in contexts where there is no instructor present, play a key role in allowing 
learning systems to intelligently adapt to facilitate the learner in online 
learning. 

This research chapter is based on a complete qualitative approach where 
IT HR and other online users such as online instructors, online students, 
and EL admin people were given a close-ended questionnaire. Researchers 
concluded this report on the responses filled by the respondents. Close-ended 
questions cover four domains namely ExL, EER, OnT, and AC [61] which 
are given in Figure 10.14. Questions on AI and ML determining EL in KKU 
are referred to in Table 10.5. 

The time Internet got commercialized, L&T took great advantage of this. 
EL in KKU has acknowledged that the activities for EL and online education 
today will become the standards of academic instruction tomorrow, [62] and 
the ML and AI will decide its pedagogy. 


Online tutoring [OnT 


Academic connectivity [AC 


Figure 10.14 ML and Al transforming EL in KKU. 


Table 10.5 Questionnaire; ML and AI transforming EL in KKU 
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Domain 


Questions 


Strongly 
Agree 


Agree 


Neutral 


Disagree 


Strongly 
Disagree 


ExL 


EL provides LMS as first 
development in online 
L&T 


45 


EL in KKU used various 
applications in student’s 
learning on LMS and BB 


44 


LMS BB is easy to use for 
all online users 


46 


LMS BB helped online 
instructors to achieve 
CLOs and build 
curriculum as focusing 
learners 


46 


EER 


EL in KKU provides 
online learning database 


46 


EL gives access to Saudi 
digital library for good 
learning resources 


46 


EL has developed full 
online course to support 
massive needs of the KKU 
learners’ massively online 
open courses (MOOCs) 
[21] 


46 


EL has developed full 
online course to train 
KKU online experts under 
massively online open 
courses (MOOCs) [21] 


46 


EL has introduced more 
online learning 
environment such as 
Google class room KKUX 
[29] 


46 


OnT 


EL has used some video 
conferencing options such 
as BB collaborate [21] 


48 


EL online communication 
works effectively for L&T 
purpose for all online users 


45 


(Contiuued) 


284 E-Learning Modeling Technique and Convolution Neural Networks in Online 


Table 10.5 Continued. 


Domain 


Questions 


Strongly 
Agree 


Agree 


Neutral 


Disagree 


Strongly 
Disagree 


EL has provided a variety 
of L&T and assessment 
tools on BB, which are 
effective for L&T purpose 


46 


4 


AC 


EL provides platform for 
online research in education 
sectors by providing access 
to learning resources and 
software 


43 


Determinants 


Questions 


Strongly 
Agree 


Agree 


Neutral 


Disagree 


Strongly 
Disagree 


EL on LMS BB is using 
automated grading methods 
for some online 
assessments through grade 
center [21] 


2 


48 


Online grade center tool on 
LMS facilitates in 
analyzing the students’ 
performance 


48 


AG&M 


Online instructor’s user 
LMS BB and other tools of 
EL to check student’s 
participation and monitor 
their educational growth 


48 


CC 


LMS BB and other tools of 
EL are used to develop 
online learning materials, 
curriculum, and design 


10 


35 


CSL 


EL in KKU provides 
crowd-sourced knowledge 
allocation and association 
on its LMS BB 


39 


EL provides tools to 
monitor the methods; 
students use and transform 
LMS BB services 


10 


35 


SLS 


LMS BB and other tools of 
EL provide communication 
between online users and 
build a liaison between 
different participants in 
L&T 


40 
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Figure 10.15 Domain of AI and ML in EL for KKU. 


Respondents were asked to rate the questions from strongly agree to 
strongly disagree on 5 scales given in four domains, as mentioned in 
Figure 10.15. Questions given in Table 10.5 provide close-ended inquiries 
on how ML and AI are transforming EL in KKU. In total, 50 respondents 
(online users) were given this questionnaire. 

Researchers did not inform respondents about the objective of the 
research because they may not know the technical aspect of the concepts ML 
and AI, and they were just requested to rate their experiences on various tools 
of EL in KKU. However, many abbreviations were used in the questionnaires 
that were described in the notes of the questionnaire for better understanding. 


10.4.1 ExL 


For analyzing ExL, four questions were asked pertaining to EL tools and 
techniques, and most of the respondents agreed on effectiveness and believed 
that EL LMS BB provides the pioneer online services through LMS BB in 
KKU. IT HR added that ML and AI played a vital role in developing the 
platform in EL deanship. 


10.4.2 EER 


More than 45% of the respondents agreed that EL tools and techniques 
have provided an excellent online interface for the fully online learning 
environment. ML and AI aided in developing massively online open courses 
(MOOCs) and other online learning programs for EL in KKU such as BB 
ultra. 
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10.4.3 OnT 


EL in KKU has witnessed that without real-time communication, EL cannot 
work effectively. More than 46% of the respondents agreed on the effective- 
ness of LMS blackboard collaboration. ML and AI are used by IT HR to 
develop video conferencing apps for LMS services. 


10.4.4 AC 


It is assumed that respondents are not having an absolute notion about the 
online research advantages available on EL tools and techniques; therefore, 
results show a neutral conclusion. However, IT HR in EL explained various 
methods and applications developed in EL KKU for online research options 
such as providing access to an online digital library, strategic alliance to other 
universities’ learning resources, and having a liaison with research and devel- 
opment site on KKU. Also, IT HR extended their expression using algorithms 
and natural language processing for showing technological advancement in 
ten years in EL in KKU with the help of ML and AI [62]. Figure 10.16 
identifies the determinants of ML and Alin EL in KKU. 

Based on these determinants, close-ended report was collected that 
defined more appropriately the role of ML and AI in EL deanship in KKU. 
Figure 10.17 shows the determinants of ML and AI for EL in KKU. 


10.4.5 AG and M 


Respondents are highly satisfied with this determinant as results show an 
absolute percentage for this. IT HR is utilizing AI language developers using 
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Figure 10.16 AI and ML determining EL in KKU. 
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Figure 10.17 Determinants of Al and ML in EL in KKU. 


algorithms and ML builds up to check assessments’ tools [62]. EL in KKU 
applied optical recognition technique for grade center tool on LMS BB. ML 
and AI facilitated the online instructors in two ways [62]: develop a congenial 
learning environment in general and develop customized learning packages to 
enhance course learning outcomes (CLOs). 


10.4.6 CC 


EL in KKU has realized the importance of deep learning [63] (DL) which is 
a more extensive type of ML; the idea of DL is to develop a smart application 
for service industries like finance, legal, education, etc. [64]. These ideas 
are to apply in developing learning material, creating attractive designs for 
learning modules, and also developing inclusive and integrated curriculum 
[64]. The domino effect shows neutral results, but IT HR elaborates the 
implication in developing exceptional online modules. 


10.4.7 CSL 


Respondents show pretty similar results such as those of CC where they 
could not identify the EL in KKU dealing with the presence of information 
reservoirs such as information available on social networking or Wikipedia 
options [64]. IT HR in EL in KKU does realize the significance of information 
gathering and creating a pool of information for online users in KKU. With 
ML and AI, they have developed an FAQ information pool dealing with 
online routine queries [64]. 


10.4.8 SLS 


In this section, results show a bit of variation, and respondents were neutral 
for methods used to transform in LMS BB services, whereas when it is about 
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online communication between all users and sharing other online services, 
respondents fairly agreed. EL in KKU explained that ML and AI technologies 
have facilitated all sections of users in EL deanship along with the strate- 
gic participants. ML and AI contributed to building the online educational 
infrastructure for EL in KKU. 

IoT has aided KKU in building its learning strategies; it has made the 
students get more involved in learning procedures and to retail for future 
applications in their professional and educational advancements. KKU has 
been applying IoT in many ways like in collaborative studies, review process, 
smart surveys, and their analysis, carrying out smart quality and accreditation 
processes for national and international societies. 


10.5 Conclusion 


CNN and a combined model for the students’ SEt classification show high 
accuracy. In the experiments, three-level (not-engaged, normally engaged, 
and highly engaged) decisions on SEt detection have been made, where the 
combined model shows high accuracy in SEt classification. The SEt detection 
method can help improve learners’ learning experiences in different online 
learning activities, such as reading, writing, watching video tutorials, and 
participating in online meetings. 

ML and AI have facilitated the development of tools, techniques, and 
online services of EL in KKU. These services include LMS BB, blackboard 
collaborates, KKUX, etc. IT HR of EL in KKU comprehended that ML 
and AI have helped in developing, designing, and implementing tools and 
techniques helpful to educators and learners. Also, EL in KKU looks forward 
to applying ML algorithms and AI in the prediction of potential advantages in 
its services’ development to conclude in meeting CLOs. Also, in future, ITHR 
plans to develop interoperability with other KKU digital services such as 
students’ registration site and faculty self-services to automate some features 
such as attendance, grades, faculty profile, etc. 

This short study concludes the growth and benefits of IoT in education 
for good communication and smart learning by the implementation of various 
loT applications. KKU has also realized the importance of loT and therefore 
applied many applications of IoT for meeting course learning outcomes and 
monitoring students” success and growth in academics. 
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Abstract 


Texture analysis plays an important role in computer vision in that it is 
critical to both the characterization and segmentation of regions in images. 
Its application is wide-ranging in different technical disciplines, such as the 
food industry, materials characterization, remote sensing, and medical image 
analysis. Over the last decade, deep learning has redefined research frontiers 
in image recognition, including texture analysis. Most of the current advances 
are driven by transfer learning with convolutional neural networks, which do 
not require large volumes of data to develop and deploy models. 

In this chapter, comparative analyses of textures based on the use of 
transfer learning with different neural network architectures and traditional 
approaches are presented via three case studies. In the first, Voronoi simula- 
tion of material textures is discussed and the ability of convolutional neural 
networks to discriminate between different textures is considered. Textures 
like these play a critical role in the geometallurgical and quantitative structure 
property relationship modeling in material science. In the second case study, 
it is shown that transfer learning can significantly improve froth imaging 
sensors to expedite advanced real-time control of mineral flotation plants. 

In the final case study, textures associated with historic gold price data 
are considered and it is shown that these methods can be used to visualize 
changes in the stock price data or any other signals more generically. 
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Keywords: Convolutional neural networks, texture analysis, transfer learn- 
ing, froth images, Voronoi texture, gold price. 


11.1 Introduction to Transfer Learning with Convolutional 
Neural Networks 


Artificial neural networks have a long history rooted in the early work of 
pioneers such as McCulloch and Pitts in the 1940s. As a family of machine 
learning models, they form the backbone of what is referred to as connection- 
ist artificial intelligence, as opposed to symbolic artificial intelligence that is 
associated with expert systems and logic formalisms. 

While experiencing successive periods of growth and stagnation, artificial 
neural networks have diversified from perceptrons to kernel-based architec- 
tures, self-organizing architectures, autoencoders, as well as deep learning 
architectures, including convolutional neural networks (CNNs). 

Although CNNSs trace their history back to the 1980s [1-3], it is only 
recently that their application in industry has seen explosive growth. Ab 
initio development of CNNs requires a) extensive computational resources, 
and b) large labeled data sets. Of these, the second requirement can be a 
critical hindrance; as such data may more often than not be available or 
very costly to acquire. As a consequence, the wide acceptance of CNNs in 
industry has been and continues to be driven by what is known as transfer 
learning. 

Transfer learning [4, 5] is achieved when learning or knowledge acquired 
in one domain, referred to as the source, is transferred to another, referred to 
as the target. For example, a CNN trained on a large labeled data set is applied 
to a different data set that is related to the training data. This is particularly 
important in image analysis, where CNNs that have been trained on the large 
ImageNet data base as source can be applied directly to other tasks in image 
analysis in a target domain or can be retrained with comparatively small data 
sets and little retraining from this domain. 

As discussed by Pan and Yang [4], different transfer learning approaches 
can be followed, based on the characteristics of the source and target data 
and domains, as summarized in Figure 11.1. The most popular approach is 
inductive transfer learning, where labeled data are available in both the source 
and target domains, as indicated by the solid red line in Figure 11.1. This 
is also the approach that was followed in the case studies presented in this 
chapter. 
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Transfer Learning 
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Labelled data No labelled 
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Figure 11.1 Transfer learning approaches. 


11.1.1 The ImageNet Large-Scale Visual Recognition Challenge 
(ILSVRC) 


ILSVRC is a challenge designed for the recognition of large-scale objects, 
based on the ImageNet data set and it has been running since 2010 [6]. It 
contains approximately 15,000,000 labeled images of 1000 common objects, 
and competing algorithms are required to identify these objects. ImageNet 
is the most common data base used in transfer learning and the winners of 
the competition are among other all publically available for transfer learning 
applications. 


11.1.2 Transfer Learning Strategies 


Different transfer learning strategies can be employed, as determined by the 
availability of computational resources and data from the application domain. 
For example, if very few or no data are available, transfer learning can be 
used directly, without any further training. Examples of this would be the 
extraction of features from single images. Under these circumstances, the 
images would simply be passed through the CNN and the features would 
be extracted from any of the various feature layers in the network (usually 
the last). If a small labeled data set is available, it may be possible to use 
these data to further train some of the last feature layers in the network, 
after adapting the dense layer of the network to the application at hand, 
whether that is classification or regression. If a larger data set is available, 
extensive retraining or further training of the network may be possible. This 
is illustrated in the left panel in Figure 11.2. 

As indicated in the right panel in Figure 11.2, deep neural networks 
trained by means of transfer learning can outperform the same networks 
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Figure 11.2 Transfer learning strategies for CNNs (left) and comparative performance with 
ab initio deep learning of the same models (right) 


trained from an initial position with randomized weights, particularly if 
relatively few data are available for training. 


11.2 Texture Analysis 


Texture in images is not formally defined, but, essentially, it can be seen as 
repeating patterns of local variation in the intensities of the image pixels, 
and texture analysis is essentially aimed at quantifying these variations in 
intensities and patterns. 


11.2.1 Textures in Nature and the Built Environment 


Textural patterns occur very widely in nature and the built environment, as 
which just a few examples are shown in Figure 11.3. Broadly speaking, the 
primary goals of image texture analysis are four-fold, focusing on texture 
shapes, categorization, segmentation, and synthesis. The classification of 
texture is used to find regions in images that have different textures; tex- 
ture segmentation focuses on the identification of boundaries between such 
regions. Texture synthesis is a relatively recent development associated with 
algorithms that can generate realistic textures similar to real ones. Finally, 
identification of texture can also facilitate the recovery of three-dimensional 
shapes of objects in images. 

For example, the derivation of quantitative descriptors for ore textures has 
grown rapidly in tandem with development in analytical instruments, such as 
QEMSCAN, MLA [7, 8], and X-ray computer tomography [9, 10]. 
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Figure 11.3 Examples of natural textures. 


Similarly, quantification of texture can be used in failure diagnostic 
models of metallic structure [11-14]. 

Shark skin textures serve as a biomimetic template for low-friction mate- 
rials [15, 16] used among other athletic swimwear, non-toxic biofouling 
control [17, 18], and marine vessels [19]. In the meat industry, quantification 
of meat texture has attracted considerable interest recently [20-22]. These are 
just a few examples of the many diverse applications that have been reported 
to date. Textures in the built environment include textures on different scales. 
On a microscopic scale, this could be associated with the crystallographic 
orientations in scanning electron micrographs. On a macroscale, these could 
be the surface characteristics of materials captured photographically, such as 
the surface finish of a steel plate in a cold rolled state during production. 


11.2.2 Traditional Approaches to Texture Analysis 


Traditionally, four main classes of approaches are recognized in texture 
analysis, viz. statistical, spectral, structural, and model-based approaches, as 
indicated in Table 11.1. These and other methods are considered in detail, 
among other by Ghalati et al. [23] and are only briefly discussed. 


11.2.2.1 Statistical methods 

Statistical methods focus on analysis of the spatial distributions of the values 
of gray levels in the neighborhoods of pixels. First-order (pixel value frequen- 
cies), second-order (pair-wise relationships of pixel intensities), or higher 
order (relationships between pixel intensities beyond pixel pairs) methods 
can be defined. 
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Generally, second-order statistical methods, such as GLCMs, are more 
powerful discriminators than first-order methods. LBPs are examples of the 
latter, combining the occurrence of pixel intensity values with local spatial 


structures [23]. 


Table 11.1 A taxonomy of texture analytical methods 
Class | Approach Method References 
First-order Bonnin et al. [24] 
statistics 
GLCM, Haralick et al. [25]; Kim et al. [26] 
3D-GLCM 
LBP Ojala et al. [27]; Ojala et al. [28]; 
Ahonen et al. [29] 
Statistical LDP Shabat and Tapamo [30] 
LBP-TOP Nanni et al. [31]; Fu and Aldrich 
[32] 
LDP-TOP Bonomi et al. [33]; Arita et al. [34] 
LTP Nanni et al. [31] 
LTP-TOP Nanni et al. [31] 
NGLDM Arita et al. [34] 
a SGLDM Moolman et al. [35, 64, 65] 
E Morphological Soille [36] 
a operations 
Primitive Jing and Shan [37] 
Structural measurement 
Skeleton Wang and Na [38] 
representation 
Wavelets Unser [39] 
Laws filters Laws [40, 41] 
Spectral 3 ; a 
Gabor transforms | Jain and Farokhnia [42]; Kim et al. 
[26] 
Random fields Cross and Jain [43]; Yang and Liu 
[44] 
Model-Based Fractals Pentland [45] 
Autoregressive Kashyap and Khotanzed [46] 
Texems Xie and Mirmehdi [47] 
Graph-based Li et al. [48]; Bashier et al. [49]; 
E Gaetano et al. [50] 
e Entropy Jernigan and D’ Astous [51]; Silva 
et al. [52] 
2 Vocabulary Varma and Zisserman [53] 
5 Deep learning (CNNs) Bastidas-Rodriguez et al. [14, 54] 
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11.2.2.2 Structural methods 

Structural methods are based on the premise that texture consists of sets of 
texture elements or primitives that can have a regular or irregular arrange- 
ment. Structural elements are designed to identify these primitives (e.g., 
blobs, line segments, or regions characterized by uniform gray level values), 
as well as inferring their arrangement in the image. 


11.2.2.3 Spectral methods 

Spectral methods are also referred to as transform-based or filter-based 
methods. These methods can be used to analyze the frequency components 
of texture in the spatial domain, frequency domain, or both. Respective 
examples of these are laws filters [40, 41], Fourier transforms [55], and Gabor 
filters [42]. With these approaches, the frequency content is represented by a 
filter response set that is obtained through convolution of the image with the 
filters in question, the statistics of which comprise the textural features of the 
images. 


11.2.2.4 Modeling approaches 

With model-based methods, textural features are represented by the parame- 
ters of models. Identification of the correct models and mapping of textural 
features onto these models are critical to the success of these approaches, 
examples of which are indicated in Table 11.1. 


11.2.3 More Recent Approaches to Texture Analysis 


More recent approaches in the analysis of textures include graph-based and 
entropy methods. With graph-based methods, graphs consisting of vertices 
(nodes) and edges (connections) are defined over image textures, after which 
features associated with the graphs can be extracted. 

Entropy-based methods are derived from information theory and are 
established approaches in time series analysis, where they provide a means to 
quantify the complexity of the time series. They are less well established in 
image analysis. 


11.2.4 Learning Approaches to Texture Analysis 


11.2.4.1 Vocabulary-based approaches 

Vocabulary learning is a bag of words method [56], designed to learn a dic- 
tionary containing textural elements computed by local descriptors. Common 
local descriptors are Leung—Malik filters [56], maximum response filters [53], 
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scale-invariant feature transforms [57], rotation-invariant feature transforms 
[58], and local binary patterns [27]. 

Extracted local descriptors are clustered to construct the dictionary, fol- 
lowed by the encoding of features and pooling into a global descriptor. 
Feature encoding is a key element of vocabulary-based learning methods. In 
the generation of texton features for example, a histogram of the features is 
obtained by counting the numbers of local features assigned to code words in 
the dictionary. 


11.2.4.2 Deep learning approaches 

CNNs are particularly suitable for texture analytical applications, given that 
they are generally designed for image analysis and that their trainable filter 
banks are finely attuned to the detection and repetitive textural patterns. Since 
AlexNet’s success in the ImageNet competition, pretrained CNNs have been 
applied highly successfully in textural image analysis. In particular, image 
texture is represented by the features learnt by these networks when trained 
to identify labels associated with images, whether discrete or numeric. In the 
majority of cases, this is accomplished by transfer learning, as discussed in 
Section 11.1. 

More recently, CNNs have been customized for texture analysis, e.g., T- 
CNN [59, 60], B-CNN [61], FASON [62], Deep-TEN [63], and deep adaptive 
wavelet network (DAWN) [54]. Although some studies have indicated the 
advantages of these architectures over more general ones [54], the perfor- 
mance of these customized networks compared to classical CNNs has not 
been widely investigated as yet. 


11.3 Methodology of Texture Analysis with Convolutional 
Neural Networks 


The texture analytical methodology considered in the rest of the chapter 
is described in more detail in this section. This includes brief summaries 
of GLCMs, LBPs, and textons that served as a baseline for comparative 
assessment of the performance of the deep learning methods that were 
investigated. 


11.3.1 Overall Analytical Methodology 


The general analytical methodology for analysis of texture based on the use 
of machine learning and transfer learning is illustrated in Figure 11.4. 


11.3 Methodology of Texture Analysis with Convolutional Neural Networks 305 


E 


engineered 


source labelled feature random predicted 
mages extraction forest image 


t labels 


Convolution 


Input 
aT m - 


Figure 11.4 Analytical approach to texture analysis with machine and transfer learning. 


a) Image acquisition: Images representing some system are first acquired. 
These images could be videographic, optical, scanning electron micro- 
graphs, or images generated from other data sources, such as signals or 
hyperspectra. 

b) Feature extraction: Textural features are subsequently extracted from 
these images. This can be accomplished by use of traditional algorithms, 
such as GLCMs, LBPs, etc., as summarized in Table 11.1. It can also be 
accomplished by using CNNs and other deep learning algorithms. 

c) Modeling: The extracted features serve as predictors or input variables 
to a model, the design of which depends on the image labels to be 
predicted. The labels can be categorical, for example, to designate a 
common object, such as in the ImageNet data base. The labels could 
also be continuous, as would be the case with image-based soft sensors. 


11.3.2 Traditional Algorithms 


Three traditional algorithms were considered in the case studies in this 
chapter, namely algorithms based on gray level co-occurrence matrices 
(GLCM), local binary patterns (LBP), and textons. These algorithms are 
briefly summarized below. 


11.3.2.1 Gray level co-occurrence matrices (GLCM) 
Consider the GLCM, Ar(p,g), of an image I parameterized by D and G. 
More specifically, D is some measure of the distance between pairs of pixels 
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Figure 11.5 An image (left) with co-occurrence matrices (middle and right) illustrating 
the frequency of pair-wise combinations of three gray levels, represented by Cartesian 
coordinates (u,v). 


in the image and G is the number of gray levels in the image. Each entry as; 
in the G x G matrix is the number of times that a gray level is associated 
with a pair of pixels displaced in the image by a distance D, as indicated in 
Figure 11.5. 

The Haralick descriptor set [25] derived from GLCM images are 
widely used in image analysis. Four of these features were used in this 
investigation. 


ENE = dăi (11.1) 

CON = = os, (11.2) 
ij 

COR = > Si, May (11.3) 

ROM apiei (11.4) 


In Equations (11.1)-(11.4), čij is the (i,j)th element of the normalized 
GLCM and m;, Mj, Si, and s; are the means and standard deviations of the 
matrix rows (2) and columns (7). 

Methods based on GLCM have been considered extensively in a range of 
applications in various technical domains. 
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11.3.2.2 Local binary patterns (LBPs) 

LBPs are derived from images based on the differences between intensities of 
neighboring pixels in images [27]. The difference in the intensity of a given 
pixel (gc) and that of one of its neighboring pixels (gp) is set to either 0 or 1, 
after thresholding with a function s, as in the following equation: 


S (Gp—9c) =0 e Je < Gp 
, if 11.5 
S (9p—gc) = 1 Je = Jp ( ) 
The next step after thresholding is to computer a local binary pattern 
(LBP) in accordance with eqn (11.6), for all p= 1,2...P. 


P 
LBP = 5 2s (9p—ge) (11.6) 
p=1 
The LBP operator in eqn (11.6) is applied to each pixel in each image 
with G gray levels, as indicated in Figure 11.6, with the results that images 
can be represented by LBPs that range from 0 to G, while the images 
themselves can be referred to as LBP images. 
Feature extraction from images based on LBP is used widely in a range of 
disciplines, including mineral processing [32, 66, 67], some aspects of which 
are discussed in the case study in Section 11.2. 


11.3.2.3 Textons 

Textons are cluster centers located in a space defined by the responses of 
localized filters. The filter response space is generated by convolving a set of 
training images with spatially configured basis functions that are contained in 
a bank of filters [56], as shown in Figure 11.7. 
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Figure 11.6 Center image pixel (shaded), surrounded by eight neighboring pixels (left), 
binary thresholded pixel values (middle). Thresholded values are multiplied with the con- 
version weights shown to generate the decimal LBP value replacing the center pixel (right). 
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Figure 11.7 Texton image features obtained from image filtering, extraction, and clustering 
of localized filter vectors. These vectors comprise a texton dictionary to which images can be 
mapped and features extracted based on the resulting histogram counts. 


The cluster centers in the filter response space are referred to as a texton 
dictionary and are typically determined by use of k-means cluster analysis. 
Image pixels are mapped to this feature space and texton features extracted 
from an image consisting of counts of the numbers of pixels assigned to 
specific texton channels [66]. 

In the case studies considered in this chapter, the Schmid filter bank 
[68] with 13 rotationally invariant filters was used, as given by eqn (11.7). 
In this equation, r are the image pixel indices, s is a scale factor, and £ is 
the frequency of the harmonic function in the Gaussian component of the 
filter [53]. 


t 
(r, s,t) = Fo (s, t) + cos (=) ge (11.7) 


Studies in mineral processing have indicated that textons are bet- 
ter able to capture textures associated with mineral processing systems, 
such as flotation froths, bulk particulates, or slurry flows, compared with 
other widely used feature extraction algorithms, including GLCM and LBP 
methods [66, 69]. 
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Table 11.2 Convolutional neural networks used in this chapter 


Neural network Depth (weights) Reference 

AlexNet 8 (61 million) Krizhevsky et al. [70] 
VGG19 19 (144 million) Simonyan and Zisserman [71] 
GoogLeNet 22 (7 million) Szegedy et al. [72] 

ResNet50 50 (25.6 million) Tian and Chen [73] 


11.3.3 Deep Learning Algorithms 


Four deep learning algorithms were considered in this chapter. These were 
AlexNet, VGG19, GoogLeNet, and ResNet50. These network architectures 
are described in detail elsewhere, as indicated in Table 11.2. 


11.4 Case Study 1: Voronoi Simulated Material Textures 
11.4.1 Voronoi Simulation of Material Textures 


In this first case study, the textural structures of natural material were sim- 
ulated with the graphic representations of Voronoi tessellation due to the 
similarity between the textures of natural materials and Voronoi. 

Image textures grouped into two equisized Classes A and B of 1000 
images each were both generated by the same bivariate uniform distribution. 
The only difference was that Class A and B images were obtained from 
tessellation of realizations of the distribution consisting of 100 and 120 data 
points, respectively. As a result, the average size of the grains in Class A was 
larger than that in Class B, as indicated in Figure 11.8. 


11.4.2 Comparative Analysis of Convolutional Neural Networks 
and Traditional Algorithms 


A PyTorch backend was used to build the CNNs used in the case studies in 
this chapter. Experiments were run on the Google Colab platform and Voronoi 
images were identified using features based on GLCMs, LBPs, and textons, as 
well as using features from AlexNet, VGG19, GoogLeNet, and ResNet50 that 
had been pretrained on the ImageNet data base. This latter approach using the 
CNNs can be referred to as deep feature extraction. 

Three variants of these deep features were extracted as previously indi- 
cated in Figure 11.2. That is, features were extracted by training of the fully 
connected layers of the networks only or, in addition to this, also training 
some or all of the feature layers of the networks. 
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Figure 11.8 Exemplars of simulated Voronoi textures in Classes A and B (top and bottom 
rows, respectively) in Case Study 1. 


To make the comparison as exact as possible, feature sets were sub- 
sequently used as predictors in the random forest (RF) models trained to 
discriminate between images from Class A and B, based on the average out- 
of-bag (OOB) errors recorded over 30 runs. Table 11.3 gives a summary of 
the hyperparameters used to develop the random forest models. 

The exception to this was where features were extracted from CNNs 
with partially or fully retrained feature layers. In this case, the ability of the 
network to correctly classify the images from Classes A and B was taken 
as an indicator of the quality of the features. For this purpose, the image 
data sets were split into training (70%) and test (30%) data sets. The test 
sets were used to assess the performance of the CNNs that were optimized 
with the ADAM (adaptive momentum estimation) algorithm [74] during 
training. 
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Table 11.3 Hyperparameters of random forest models used in Case Studies 1 and 2 


Hyperparameter Description Value 

Mery Number of candidate variables evaluated at ym 
each node split 

Mires Proportion of total number of samples 70% 

evaluated by each tree 
K Number of trees in the forest 500 
Node size Smallest number of samples supporting a 1 

terminal node 

Replacement Drawing samples with or without With 

replacement 
Splitting rule Criterion used for splitting nodes Gini 


It was assumed that the pretrained neural networks already had near- 
optimal weights and, therefore, a small learning rate of 0.0001 was used so 
as not to unnecessarily disrupt the weight settings of the networks. As an 
additional measure to prevent overfitting, image augmentation was used dur- 
ing training. This included horizontally flipping, randomly rotating, shearing, 
and shifting the original images. 

Table 11.4 gives a summary of the performance of the different models. 
The RF models using the traditional feature sets (GLCM, LBP, and textons) 
as predictors performed satisfactorily, reaching an accuracy of 87%-89%. 
Use of the features generated by the partially retrained networks improved 
the accuracy to at least 93%, while features obtained with the fully retrained 
networks further improved the accuracy to at least 95%. 

The top performer was ResNet50, shown in the first row of Table 11.4. 
Interestingly, this network consistently outperformed the other deep learning 
architectures within each of the feature variant groups, i.e., when compared 
with its counterparts with fixed, partially retrained, or fully retrained feature 
layers. 

Further insight into the performance of the algorithms can be gained by 
mapping the features to bivariate score plots with a f-distributed stochastic 
neighbor embedding (t-SNE) algorithm. The algorithm embeds features in 
a low-dimensional space so as to optimally preserve the similarities of the 
points in the original high-dimensional space. This is achieved by minimizing 
the Kullback—Leibler divergence of the distributions of the points in the high- 
and low-dimensional feature spaces. 

The t-SNE scores of the image features are shown in Figure 11.9 together 
with their associated classification accuracies. Although all the traditional 


312 Quantitative Texture Analysis with Convolutional Neural Networks 


Table 11.4 Assessment of texture features as predictors in Case Study 1 


Model Number of features Accuracy (%) 
ResNet50** 2048 99.0 
VGG19** 4096 98.5 
GoogLeNet** 1024 98.3 
ResNet50* 2048 98.3 
AlexNet** 4096 95.8 
GoogLeNet* 1024 93.8 
VGG19* 4096 93.8 
AlexNet* 4096 93.3 
ResNet50 2048 91.4 
LBP 59 89.2 
AlexNet 4096 88.3 
Textons 20 87.1 
GLCM 4 87.0 
VGG19 4096 81.6 
GoogLeNet 1024 78.3 


Note: In models indicated by one or more “*,” accuracy represents the performance of the end-to-end 
classifier. “*” and “**” indicate networks in which some or all the feature layers were retrained, 
respectively. All other models were random forests, where the accuracies indicated are out-of-bag values 
derived from the training data set. 


methods and the direct deep feature extraction methods can produce reason- 
ably good separation between two classes, it is more breath-taking to see 
that two separate clusters form when using deep features extracted from the 
partially retrained and fully retrained networks. 

These score plots shown in Figure 11.9 confirm the advantage of retrain- 
ing the feature layers of the CNNs. Retraining of some of the final feature 
layers (**”) led to an improvement in the features over CNNs where no such 
retraining was done. In turn, retraining of all the feature layers (“**’’) led to 
further markedly improvement in the performance of the classifiers, as clearly 
indicated by the separation of the clusters in the score plots. 


11.5 Case Study 2: Textures in Flotation Froths 
11.5.1 Froth Image Analysis in Flotation 


Froth flotation is an important operation in mineral processing, where valu- 
able material is separated from waste or gangue in a slurry in an agitated tank 
or flotation cell. Chemical reagents added to the slurry support the generation 
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t-SNE score plots of the traditional predictors (top row, GLCM, LBP, and 
textons), AlexNet (second row), VGG19 (third row), GoogLeNet (third row), and ResNet50 
(bottom row) in Case Study 1. Superscripts “*” and “**” indicate CNNs where feature layers 
were partially or fully retrained. Black dots and red “+” markers indicate Classes A and B, 
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of a froth layer on top of the slurry, as well as enhancing the hydrophobicity 
of the valuable material. This facilitates the concentration of the valuable 
species in the froth, from where it can be easily recovered. 

It is a complex process and the appearance of the froth is a useful 
indicator of the performance of the flotation cell. However, operators may 
find it very challenging to discriminate between different froth structures 
and computer vision has long been investigated as a more reliable approach 
toward decision support on flotation plants. In this case study, the application 
of deep learning to froth images obtained from a platinum metal group 
flotation plant in South Africa is considered. The collection of the images 
and previous analyses are discussed in more detail by Marais and Aldrich [75] 
and Horn et al. [76]. 

6856 images of the froth in a primary cleaner flotation cell of size 256 
x 256 pixels were collected over a 4-hour period. During this time, the air 
flow to the cell was periodically varied. In addition, the platinum content of 
samples of the froth was analyzed in a laboratory, and these results could be 
associated with the froth images that were collected. 

Froths associated with normal operating conditions bearing high concen- 
trations of platinum were labeled as Class A. Three other operating regimes 
were also identified, as indicated by froth structures labeled as Class B, Class 
C, and Class D. Progressively lower platinum values were associated with 
these three classes. The four operating regimes are represented by 1960, 1260, 
1722, and 1940 images, respectively.Figure 11.10 shows an example of an 
image of each of these ordinal classes, as well as the relative concentration of 
platinum in parentheses. 

Figure 11.10 also shows that the coarser froth typical of Class A was 
comparatively distinct from the structures of the other three classes that 
exhibited progressively finer bubble size distributions. These structures were 
linked to the different platinum grades in the images from left to right in 
Figure 11.10. As a consequence, identification of a particular froth class 
would also give an estimate of the grade associated with the image. 


11.5.2 Recognition of Operational States with Convolutional 
Neural Networks 


The same framework that was used in Case Study 1 was used for classification 
of Classes A—D in this case study. The same hyperparameters summarized in 
Table 11.3 were used in this case for optimizing the random forest models. 
However, in this case study, only GoogLeNet is compared with the three 
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Class A (1) Class B (0.464) Class C (0.306) Class D (0.115) 


Figure 11.10 Typical images associated with Class A, Class B, Class C, and Class D 
operational regimes. Relative platinum concentrations are shown in parentheses. 


traditional models. This is again done based on different levels of training 
of the feature layers of the network, as was done in Case Study 1. 

Table 11.5 gives a summary of the performance of the different models, 
as well as the corresponding number of features used as predictors in each 
model. The classification accuracies with GoogLeNet with feature layers 
retrained in part were 94.4% and 99.5% when feature layers were retrained 
in full. 

This was markedly higher than the accuracies (70%-82%) obtained with 
any of the other feature sets, including the traditional feature sets and the 
untrained GoogLeNet features. This observation confirms the effectiveness 
of retraining CNNs. 

Principal component analysis (PCA) was used to visualize the fea- 
tures extracted from the froth images, by projecting the features to a 
two-dimensional space, as shown in Figure 11.11. More specifically, the 
principal component model was constructed from the features of froth Class 


Table 11.5 Assessment of texture features as predictors in Case Study 2 


Model No. of features Accuracy (%) 
GoogLeNet** 1024 99.5 
GoogLeNet* 1024 94.4 
Textons 20 81.8 
GoogLeNet 1024 79.9 
GLCM 4 71.8 
LBP 59 69.8 


Note: In models indicated by one or more “*,” accuracy represents the performance of the end-to-end 
classifier. “*” and “**” indicate networks in which some or all the feature layers were retrained, 
respectively. All other models were random forests, where the accuracies indicated are out-of-bag values 
derived from the training data set. 
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Figure 11.11 Principal component score plots of the GLCM (top, left), LBP (top, right), 
textons (middle, left), GoogLeNet with transfer learning (middle, right), GoogLeNet with 
partially retraining (bottom, left), and GoogLeNet with fully retraining (bottom, right) in Case 
Study 2. Classes A-D are denoted by black circles, red stars, blue squares, and green “**” 
markers, respectively. 
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A only, followed by projection of the features from the images of the other 
classes onto the model. 


11.6 Case Study 3: Imaged Signal Textures 


11.6.1 Treating Signals as Images 


Transfer learning can also play an important role in signal processing with 
deep neural networks. The ImageNet architectures can be used directly by 
converting the signals to images, as illustrated in Figure 11.12. As indicated 
in Figure 11.12, a sliding window of a user specified length (b) is moved 
across a multivariate time series with step size s. 

The data in each segment is consequently used as a basis for generating 
an image that captures the information in the window. Distance plots of size 
b x b are easy to construct, as their (1, /)th elements are the distances between 
the point i and point j captured by the window. These images will display a 
texture commensurate with the behavior of the time series. 

Following this, features can be extracted from the image by any of a 
number of algorithms designed for this purpose. These features, together with 
labels derived from the time series, are collected in a data matrix that can 
then serve as the basis for analysis of the time series or development of time 
series models. 

Examples of image representations of time series or signals are shown in 
Figure 11.13 for Euclidean distance plots, Gramian angular fields, Markov 
transition fields, and continuous wavelet transforms, all of which have 
recently been used in time series analysis. 
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Figure 11.12 Multivariate image analysis of signals. 
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11.6.2 Monitoring of Stock Prices by the Use of Convolutional 
Neural Networks 


In this subsection, a historic daily gold price data set is considered over the 
period from 1979 to early 2021. The data set includes over 11,000 values of 
the daily closing price of gold in USD. 
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Figure 11.13 Example of image representations of time series measurements on a South 
African base metal flotation plant. 
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Figure 11.14 Percentage change in the daily closing price of gold from 1979 to 2021. 
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Table 11.6 Basic statistics for the % change of gold price data set for Case Study 3 
Count | Mean St. Dev. | Min Max 25% 50% 75% 
11170 | 0.026 1.183 — 13.24 | 13.32 —0.465 | 0.000 0.517 


W 


Table 11.7 A Basic statistics for the absolute % change of gold price data for Case Study 
Count | Mean St. Dev. | Min Max 25% 50% 75% 
11170 | 0.755 0.910 0.000 13.315 0.190 0.489 0.995 
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Figure 11.15 100-day (left), 200-day (middle), and 300-day (right) Euclidean distance plots 
of the daily percentage change in the historic gold price. 
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Figure 11.16 t-SNE plots of the 100-day (top, left), 200-day (top right), and 300-day 
(bottom) Euclidean distance plots of the daily percentage change in the historic gold price. 


The percentage change of the gold price with time is shown in 
Figure 11.14. The basic statistics about the change and the absolute value of 
the change are shown in Tables 11.6 and 11.7. The gold price changes consid- 
erably on some days (greater than 10%), while, during most of the time (more 
than 75% of the days), the change is within the absolute amplitude of 1%. 
The mean absolute change over the whole period is 0.755%. The positive 
change frequency is almost the same as the negative change frequency. 
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In this subsection, the effects of the window sizes of 100, 200, and 300 
are studied, which indicate an interval of 100, 200, and 300 days. The time 
series of gold price percentage change is segmented into 111 x 100, 55 x 
200, and 37 x 300 matrices, respectively, after which each small time series 
segment is converted to a normalized distance matrix of dimension of 100 
x 100, 200 x 200, and 300 x 300. The distance matrix is then converted to 
a real image matrix by color mapping. The respective last image from each 
image set is shown in Figure 11.15. 

The following numbers of textural images were extracted from the time 
series with the different window sizes: 111 images (100 x 100 x 3), 55 
images (200 x 200 x 3), and 37 images (300 x 300 x 3). These were passed 
through GoogLeNet, untrained on the target data to get the respective direct 
deep features. These deep features are visualized using the t-SNE algorithm, 
as shown in Figure 11.16. The color bars in these figures indicate time, with 
yellow indicating a more recent period than blue. 

As can be seen from Figure 11.16, the clearest patterns are associated 
with the 300-day moving window, where more recent patterns (yellow and 
orange) appear to be segregated to some degree from earlier patterns (blue 
hues). 


11.7 Discussion 


With their deep architectures, CNNs continue to outperform classical meth- 
ods in image recognition in an increasing number of domains [77]. Since 
ab initio design and training of large CNNs is costly and the availability or 
acquisition of sufficient data for this purpose may be unfeasible, the use of 
pretrained networks may be the only viable approach. 

All the case studies demonstrate advantage of features extracted with 
CNNs over those extracted by traditional methods, specifically GLCM, LBP, 
and textons algorithms. This is particularly the case when the feature layers 
of the CNNs are retrained through end-to-end classification. 

Even when it is not possible to retrain the feature layers of the CNNs, for 
example, when training data are available, these studies suggest that CNNs 
pretrained on ImageNet data are able to achieve results equivalent to those 
achievable with traditional methods. 

The reliability of the CNNs may have been affected by the differences 
between the ImageNet images used as source data and the textural images 
used as target data. The relatively small sizes of the target data sets used 
in Case Studies 1 and 2 would likely also have had an inhibitory effect on 
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the models. Retraining of the feature layers of the networks would generally 
compensate for the dissimilarity of the source and target data sets. 

For the retraining of the pretrained networks, backpropagation is used to 
fine-tune the weights of the pretrained networks. Earlier layers of the pre- 
trained networks tend to capture generic features that are useful for different 
tasks across different domains, while layers further down in the processing 
paths of the networks tend to capture more complex target domain-specific 
features. It, therefore, makes sense to retrain the weights of later layers only, 
to force the CNNs to learn high-level features that are specific to the target 
data set. Furthermore, full retraining of all the layers led to near-perfect 
accuracy when using ResNet50 and GoogLeNet in Case Studies 1 and 2, 
respectively. 

Finally, it should be noted that one of the drawbacks of the deep learning 
approaches compared to the traditional ones is the cost of model development. 
Large-scale development of these models is best done in high performance 
computing environments. While these are becoming more accessible via 
cloud computing, for example, it is likely to remain a challenge, as large- 
scale today may pale in comparison to what would be large-scale in future. 
To some extent, this can be addressed by a future focus on texture analysis 
with more compact CNNSs (e.g., MobileNet or Squeezenet). 


11.8 Conclusion 


In this chapter, texture analysis with CNNs is explored, from which the 
following conclusions can be made. 


a) Transfer learning with CNNs, such as ResNet50, GoogLeNet, VGG19, 
and AlexNet pretrained on the ImageNet data base of common objects, 
can be used without any further training to generate highly discrimina- 
tive features of image textures. 

b) Further improvement was achieved by partial or full retraining of these 
networks. This led to significantly higher or even near perfect classifica- 
tion of the different Voronoi textures in Case Study 1 and the recognition 
of the froth states on a platinum metal group flotation plant in Case 
Study 2. 

c) Of the four CNNs that were compared in Case Studies 1 and 2, i.e., 
AlexNet, VGG19, GoogLeNet, and ResNet50, the latter yielded the 
most reliable features. All the networks performed as well as or better 
than the traditional methods, i.e., LBP, GLCM, and textons. These 
results are in line with those of other emerging investigations. 
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d) Finally, the impact of transfer learning extends beyond textural fea- 
ture analysis of optical images but is also having a major impact on 
(non)linear signal processing. 
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Abstract 


The Internet of Things (IoT) systems have revolutionized the medical systems 
with the use of devices and sensors for the collections of data for various 
uses. Data is generated by these devices in a variety of formats, including 
text, photos, and videos. Hence, getting accurate and useable data from the 
massive amounts of data generated by the IoT-based system is a difficult 
task. The diagnosis of various diseases using data generated by IoT has 
recently emerged as a potential topic that necessitates sophisticated and 
effective methodologies. Due to the wide range of disease symptoms and 
indications, reliable diagnosis is difficult. Existing solutions either rely on 
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handcrafted features or conventional machine learning model. Therefore, this 
chapter presents the applicability of loT-based enabled convolutional neural 
network (CNN) in healthcare diagnosis. The challenges and future prospects 
of IoT-based enabled CNN are discussed. The chapter proposes an intelligent 
IoT-based enabled CNN for the diagnosis of patients’ health status, and the 
CNN was used to diagnose the capture data using IoT-based sensors for the 
disease. As a result, the system takes advantage of the dataset’s and CNN’s 
properties, assuring excellent reliability and accuracy. For a case study on 
healthcare dataset classifications, the suggested system demonstrates real- 
time health monitoring and tests the system’s performance in terms of various 
metrics. The performance of the proposed system shows better performance 
when compared with existing methods with an accuracy of 98.4%, specificity 
of 98.7%, sensitivity of 98.9%, and F-score of 98.3%. 


Keywords: Internet of Things, convolutional neural network, healthcare 
systems, diagnosis, deep learning, machine learning. 


12.1 Introduction 


The new emergence in information technologies has brought about Internet of 
Things (IoT), and this has really changed the way we think and our behavior 
globally [1, 2]. This technology has been used in various domains like trans- 
portation [3, 4], agriculture [5, 6], smart cities [7, 8], industries [9, 10], and 
especially in healthcare systems for the monitoring and diagnosis of various 
diseases [11, 12]. The transformation of IoT in healthcare systems created 
new innovation called Internet of Medical Things (IoMT) which is used 
for indoor quality monitoring, elderly person monitoring, disease diagnosis, 
and treatments, among other [2, 13, 14]. This paradigm shift in healthcare 
has provided various opportunities with the use of wearable devices and 
sensors for capturing of physiological signs for enhanced healthcare systems 
in close relation with mHealth and eHealth medical systems [13]. IoT has 
brought about various advantages into healthcare systems like availability 
and accessibility with low cost to medical diagnosis and treatment, thus 
explaining the increase in the usage of this innovation in recent years. The 
use of biophysical data for various diseases and monitoring of patients to 
support healthcare system decisions has helped in various ways to ease the 
burden of health workers and greatly reduce medical cost. 

The new technological innovations like IoT-based systems with artificial 
intelligence (AI) have been used to process and handle the big data generated 
by sensors to revolutionize the advancement of medical services [15, 16]. 
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The IoT-based systems used in generating big data can be used to capture 
data that be processed to give useful clinical information that can be used by 
specialists for the prediction of patients’ health condition [17]. The informa- 
tion examination and information extraction are perplexing cycles that should 
guarantee further developed security strategies [18]. loTs convey estimation 
information to a focal stockpiling area for concentrated dynamic [19]. In the 
clinical area, such estimation information is typically physiological signs like 
pulse, heart beat signs, and temperature estimation [20]. 

The use of AI on generated huge data can offer a few chances for medical 
service frameworks to accurately predict any diseases from the captured 
data from the IoT-based system [21]. The use of AI techniques used for the 
processes of big data can help in the progression of patients’ wellbeing world- 
wide [15, 22]. IoT innovations permit diminishing worldwide expenses for 
constant sickness anticipation. The constant wellbeing information gathered 
by these frameworks can be utilized to help patients during self-organization 
treatments. Cell phones with versatile applications are frequently utilized and 
incorporated with telemedicine and mHealth through the loMT [23]. The 
deep learning (DL) approaches are utilized in wellbeing-based applications 
to accomplish promising and good exhibitions for a sizeable measure of 
information. The DL and intellectual calculations have encountered signs 
of progress as of late and, subsequently, have been utilized to take care of 
numerous complicated issues. 

The huge data captured using IoT-based systems can be processed by 
DL models since the data can be stored on the cloud database for further 
processing [11]. The methods can be used for the processing of data to get a 
proper insight that can be used for the treatment of patients who suffer from 
any form of diseases. The DL beats the traditional-based models for diagnosis 
and decision-making. This is significant in a pre-portrayed undertaking, and 
the norms are acquired from genuine information. DL utilizes the restricted 
information design to settle on some astute choices in brilliant medical care 
frameworks that the doctors can make ends. Large information is the center 
part of the DL procedures’ superior [11]. The use of wearable body sensors 
(WBS) to gather huge data can be used to consolidate the treatment of patient 
in real time with the use of AI models to process such data. The use of AL 
techniques will help in getting intelligent results with the captured data and 
useful for the interpretation of the data. Hence, it helps in getting splendid 
responses from the huge data for clinical use by the specialists to save lives 
and time [11]. 
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Therefore, this chapter presents an innovative IoT-based enhanced CNN 
for patient disease diagnosis. The CNN was used to diagnose the capture data 
using loT-based sensors for the disease. A practical case of heart disease is 
used to test the performance of the proposed systems. 


The contributions of the chapter are stated as follows: 


(i) The IoT-based enabled CNN applicability in clinical systems is pre- 
sented, and challenges and future prospects of loT-based enabled CNN 
are presented. 

(ii) An intelligent IoT-based enabled CNN model for disease diagnosis 
framework is proposed. 

(111) The performance of the proposed method is presented using a practical 
case in healthcare system. 


12.2 Internet of Things Application in the Healthcare 
Systems 


The IoT-based framework is a new breakthrough that is expected to lead to the 
discovery of novel pharmaceuticals and clinical treatments. The productivity 
and nature of healthcare have high potential characteristics like adaptability, 
flexibility, fondness, cost reduction, and being very fast. This innovation 
assists us with understanding the particular dangers identified with security 
and privacy. IoT is primarily to associate the world through connection vari- 
ous devices and sensors. In the healthcare framework, IoT is predominantly 
used to get to data in real time. IoT is principally interconnected by more 
devices with the utilization of the Internet. In the healthcare framework, the 
IoT is fundamentally worked to get to the enormous size of data. It denotes 
the ability of a matrix of PCs to transmit programs and data. For details, this 
innovation can be generally refined by various servers as retail organizations, 
whose necessities can be totally met by combined use. 

Here, an ontology-based crisis clinical benefit framework gives the way 
of gathering, coordinating, and interoperating IoT information. In view of 
the ontology construction in crisis clinical benefit, choices can be made in 
IoT with its dynamic interaction, and a choice of decision support system 
(DSS) can be utilized [24, 25]. Through ontology development, decisions 
for the medical services system can be made in real time and with ease. In 
healthcare administration, specialists, patients, and doctors take significant 
part and they are likewise engaged with a whole overhauling. Specialists need 
to get to the patient record from anywhere and in real time by putting it away 
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in a conveyed way [26]. Patients additionally need to think about the special- 
ists’ accessibility and the situation with the equipment (occupied/free) [27]. 
To assist patients with getting to specialists’ accessibility status, an IoT asset 
model is required for this openness. 


12.2.1 Internet of Things Operation in Healthcare Systems 


The applications of IoT-based systems play a critical role in medical services 
for taking care of patients in real time [13, 28]. The use of a computer-aided 
diagnosis data gathering approach reduces the danger of human mistakes on 
data gathering [29]. This would increase diagnosis quality and lessen the 
chance of human errors, such as the gathering or transmission of erroneous 
information, which is harmful to individuals’ wellbeing [13]. In remote 
identification, forecast, reconnaissance, recuperation, and treatment, where 
IoT innovation has a significant influence, telemedicine has, as of late, 
been comprehensively applied. With interest in planning keen advancements, 
for example, medical care global positioning frameworks, clinical analysis, 
forecast and therapy systems, and keen healthcare, IoT has, as of late, been 
executed in the clinical area. Data gotten from clinical devices, like the use 
of wearable sensors, CT, and MRI machines, has been used to improve the 
telemedicine and smart healthcare generally. 

Sensors and devices may communicate within a smart environment, and 
knowledge can be easily exchanged across healthcare networks, thanks to 
IoT-based technology. However, the new fields arising expediently, they 
likewise have their difficulties, especially when the objective is medical 
service frameworks with a muddled issue, troublesome in energy-proficient, 
protected, adaptable, appropriate, and reliable arrangements. IoT is expected 
to reach a market size of $300 billion in the medical services framework by 
2020, comprising clinical apparatuses, foundations, services, and solutions 
[30]. This longing for modified e-medical care is additionally prone to be 
advanced by government drives. Numerous dataset groups and assets are 
required overall to store huge data, and these have turned into a test. It is 
a significant issue to get substantial patterns from huge data, like patient 
analytic data. 

These days, an assortment of arising applications for various conditions 
is being created. Coldhearted frameworks and sensors are most generally uti- 
lized for genuine or not-so-distant future applications [31]. Recently, various 
studies have developed several wearable devices for monitoring patients’ 
physiological parts especially in the remote areas [32, 33]. For example, this 
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has been used to monitoring various illnesses like body temperature, blood 
sugar, and blood pressure among others for patients’ wellbeing. The systems 
have been used in the areas of elderly activities and food propensities for 
proper monitoring of these patients like working, sleeping, etc. The cloud 
warehouse database is used to store the huge data gathered from the loT-based 
devices and sensors which can be processed for the prediction of disease 
in patient body [34]. Patients in hospitals, particularly those in life support, 
require close surveillance and attentiveness in order to respond to potential 
crises and preserve lives. 

Sensors are used in these activities which are directed to capture biomed- 
ical signals, which are then evaluated and stored in the cloud before being 
given to online caregivers [13]. By analyzing the flow of data collected 
by the sensors, a group of healthcare professionals interact and thereafter 
analyze patients according to their specialties. Then, for high-risk individuals, 
determining the emergency condition (patients requiring urgent or emer- 
gency operations, cardiovascular problems, and so on), the process will be 
simple. 

In healthcare loT systems, context awareness is a key criterion. As it 
can track down the patient’s condition and the climate where the patient was 
found, it will extraordinarily help the medical service experts to comprehend 
the varieties that can impact the wellbeing status of these patients. Moreover, 
the difference in the actual condition of the patient might build the level of its 
weaknesses to illnesses and be a reason for his/her wellbeing disintegration 
[11, 13, 14]. The utilization of a few kinds of particular sensor catching 
different data about the patient’s state of being like his strolling, running, 
dozing, and so on or the climate where the patient is like wet, chilly, warm, 
and so forth, and the coordinated effort between them to gather the significant 
data, will give a superior comprehension of the patient’s conditions, while 
they are hospitalized, at home, or anyplace. Besides, it will give assistance in 
crisis cases to find the patients and know about the sort of crisis mediation 
that can be taken. 

In terms of performance, the integrated cloud and IoT-based application 
outperform traditional cloud-based apps. Military, banking, and medical sys- 
tems can all benefit from the cloud-based IoT system. The cloud-based IoT 
strategy, in particular, would aid in providing therapeutic uses with efficient 
resources for monitoring and retrieving data from any remote place. Data can 
be collected in real-time update over a predetermined period of time using the 
data-centric embedded system. This can be used to monitor a patient in real 
time for the wellbeing of individuals, especially elderly patients in remote 
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areas; it is useful for disease diagnosis and treatment in most cases before 
serious health challenges. 

Due to the massive data generated by smart devices, in large data pro- 
cessing, DL models can be used to make intelligent decisions. The approach 
of applying data processing procedures for certain sectors necessitates iden- 
tifying the data involved, such as its velocity, variety, and volume. Standard 
data analytics architecture entails the creation of a neural network model, 
an implementation of complex, and a clustering process model, as well as 
the deployment of sophisticated methods. The IoT gadgets can be utilized 
to produce different kinds of information from a few sources and, conse- 
quently, portray the components of the information created for appropriate 
information dealing with. This assistance in taking care of the different 
qualities of catch information for versatility and speed accordingly helps in 
tracking down all the models that can give the best outcomes continuously 
internationally with no difficulties. These are well-known to be amongst the 
IoT’s major problems. In any case, in the most recent advances, these are 
generally issues that produce an enormous number of potential outcomes. 
Such data can be gotten to utilizing the most recent medical care applications, 
and the information is safely put away on the cloud server. 


12.2.2 Internet of Things and Patient Monitoring Systems 


For many real-world applications, monitoring systems is a significant prin- 
ciple. Many people’s health is at risk today all across the world because 
of a lack of adequate healthcare monitoring [35]. Almost every day, the 
elderly, children, and chronically ill people are required to be inspected. The 
feasibility of a remote monitoring system will assist in avoiding unnecessary 
hospital visits, as shown in Figure 12.1. Due to their vital condition, their 
health can frequently go unrecognized until diseases progress to a crisis point. 
The remote access sensor allows caregivers to perform pre-diagnosis and 
intervention before problems arise. 

As a result, people-centric IoT will be employed for various units of 
neurocognitive disabilities, allowing them to live more autonomously and 
with simple existence [36]. As the sensor is connected to the skin at explicit 
areas, it tends to be utilized for diagnosing the heart condition and the impact 
of the medication on its exercises. Numerous patients who experience the 
ill effects of constant sicknesses like cardiopulmonary illness, asthma, and 
heart disappointment are situated far away from the clinical consideration 
offices. The ongoing observation of such patients through remote checking 
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Figure 12.1 Real-time remote monitoring system. 


frameworks is the most encouraging application. A portion of the continuous 
medical care observing frameworks is far off quiet following and checking 
framework, remote checking of cardiovascular patients, and heartbeat check- 
ing framework. The continuous checking framework comprises a distant 
clinical observing unit and a checking focus. It examines the data from the 
sensor dependent on ongoing investigation and an admonition sign will arise 
for crisis and analysis. The signs from the body sensor are taken to the 
comparing clinical focus through the wireless local area network (WLAN) 
framework. Accordingly, this ongoing observing framework gives data about 
patients’ ailments, and it might likewise decrease more confusion and give 
treatment most punctually. Consequently, it gives a precise and constant 
checking framework in the medical care area. It additionally helps for quicker 
recognition of info sensors and recoveries a day to day existence. 

Due to a lack of accessible access to good structural health monitoring, 
many health conditions may go misdiagnosed, which is a problem that exists 
all across the world. On the other side, the IoT enables small, efficient 
wireless technologies to bring monitoring to patients rather than the other way 
around. The secure recording of patient health information is possible with 
these solutions. Before being exchanged via computer networks, the data is 
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evaluated using a network of equipment and complex computations. Medical 
specialists can then provide appropriate health advice via the Internet. 


12.2.3 Internet of Things and Healthcare Data Classification 


The advancement of IoT-based systems can work better when compared with 
the traditional networks methods with well-organized devices and sensors. 
This has played an important role in society since its inception, spanning 
everything from traditional hardware to ordinary family protests [37], and 
has recently attracted serious attention in various fields like transportation, 
agriculture, education, and, most especially, in healthcare systems. The sys- 
tems can be used remotely to successfully regulate and check elderly patients 
without seeing any doctors or specialist be treatment was done upon any 
sickness. The devices and sensors used can easily communicate with each 
other with the use of Internet and other wireless connection devices. These 
devices can even be used to make decisions with the assistance of doctors or 
specialists. To make IoT more brilliant, a slew of inquiry improvements have 
been incorporated, including some of the most significant advances in datum 
mining. 

Information mining incorporates discovering novel, captivating, and con- 
ceivably supportive models from colossal enlightening assortments and 
applying estimations to the extraction of stowed away information. Various 
terms are used for information mining, for example, information exposure 
(mining) in informational collections (KDD), information extraction, infor- 
mation/plan assessment, information antiquarianism, information burrowing, 
and information gathering [38]. The objective of any information mining 
measure is to develop capable judicious or illuminating algorithms that can 
best explain huge data captured for proper decision-making and, if possible, 
summarize the data to a meaningful status [39]. Considering a far reaching 
viewpoint on information mining’s helpfulness, information mining is the 
most well-known method for discovering interesting data from a ton of 
dataset aside in informational indexes, data conveyance focuses, or different 
information files. 

There are three areas of data mining to really enjoy the full potential of 
data mining in healthcare systems. The following measures are counted in 
data mining processes and progression. 


(i) Data preparation: The removal of noise from data, synchronization of 
data captured from various sources, and the extraction of meaningful 
information from the captured data are the three main sub-steps in 
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data preparation. The data has to be organized in such way that it is 
going to be meaningful for the users and the processing will become 
easier. 

(11) The application of computations approach on data is called data mining 
to discover and evaluate the classes of collected data. 

(111) The interpretation of the gathered results from the mined data to 
become meaningful decision-making information to the clients and 
users. 


Characterization is significant for the administration of dynamic. The order 
in mining is called the predefined or allocating and given a definition and 
meaningful to what is call data mining. To precisely define a class for each 
case in a data is the main objective of characterization in data mining [40]. 
For instance, the use of low, medium, or high in credit hazards can be used in 
a characterization model to define various classes to recognize candidates in 
data mining in various fields [41]. 

Various data mining techniques have been used for the processing of huge 
data in healthcare systems; such techniques are K-means, rule-based mining, 
clusters, progressive grouping, neural organizations, Bayesian organization, 
and backing vector machines. 

The most important component of loT applications is an effective data 
analytical method capable of performing tasks such as classification, cluster- 
ing, regression, and so on. In loT applications, DL is commonly employed 
for data analytics. DL and the IoTs have been named as two of the top three 
strategic technical trends for 2017 [42]. The inadequacy of traditional ML 
approaches to match the expanding analytical needs of IoT systems is the 
reason for this zealous promotion of deep learning. Thankfully, advances 
in ML paradigms are allowing desirable data analytics in loT applications 
to enter the picture. Picture recognition, data recovery, discourse recogni- 
tion, regular language handling, indoor restriction, physiological and mental 
state recognition, and so on have all shown significant results using DL 
models, and these are the foundation administrations for loT applications 
[43, 44]. 

DL has cleared a way for enormous forward leaps in the medical care field 
by finding historic structures like hierarchical computing design (HiCH); this, 
when combined with notions such as convolutional neural networks, gener- 
ates CNN, allowing IoT devices to circumvent WBAN inaccurate restrictions 
[45]. ML classifiers that focus on missing values, decision tree generation, 
and other AI advancements, such as C4.5, C5.0, KNN, and EM, make the 
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working module/architecture significantly more efficient [46, 47]. There are 
also several meta approaches that augment ML techniques for improved 
performance [48]. 


12.3 Application of Internet of Things Enabled 
Convolutional Neural Network in Healthcare Systems 


Because of its significant level and inescapable checking, a few current 
advancements, similar to the Internet of Things, are turning out to be more 
open these days. The IoT likewise permits a methodical and skilled way 
to deal with patient medical care dependent on distant patient checking 
and portable wellbeing. Moreover, DL models are utilized in wellbeing- 
related applications to produce promising and good outcomes for a lot of 
information. There are breaks in information transmission to the cloud when 
following the patients’ wellbeing status. Since it has shown extraordinary 
results in different enterprises, DL would assume a basic part in building a 
more intelligent loT. Picture recognizable proof, recovery of data, sound ID, 
computational etymology, indoor situating, psychophysiological condition 
recognition, etc., are altogether instances of such applications, and these 
are the administrations that IoT applications depend on. Understanding the 
potential outcomes of DL for loT prescient examination becomes urgent at 
this stage. This is because of the way that DL models are appropriate to 
handling the exceptionally mind boggling information produced by IoT gad- 
gets. To blend enormous datasets detected from numerous modalities, certain 
profound learning models have been proposed. DL models have likewise 
been produced for separating fleeting associations from IoT information. DL 
models are more powerful in certain fields, and RBMs offer a great deal of 
possibilities with regard to including extraction and classification. 


12.3.1 loT-Enabled CNN for Improving Healthcare Diagnosis 


The sophisticated loT with unlimited networking possibilities for biomed- 
ical data analysis is strengthening the interaction between technology and 
healthcare society. In the last few years, deep neural networks (DNNs) and 
the rapid public acceptance of medical wearables have been successfully 
suddenly transformed. loT enabled by DNN has allowed for revolutionary 
medical breakthroughs and unique probability in medical data processing 
in the healthcare industry [49]. Despite this development, there are still a 
number of concerns to be resolved in terms of service quality. Applying deep 
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networks to deliver a high level of quality in important aspects in end-to-end 
reaction time, overhead, and accuracy is the key to prospering in the move 
from client-oriented to patient-oriented medical data analysis for healthcare 
society [50]. 

The profound learning procedures like CNNs, long short-term memory 
(LSTM), auto-encoders, profound generative models, and profound con- 
viction networks have, as of now, been applied to effectively examine 
conceivable enormous assortments of information [50]. Utilizations of these 
techniques in clinical signals and pictures can help clinicians in clinical 
dynamic [51]. Due to higher time and network bandwidth, these home 
automation sensors for smart medical systems are still underdeveloped in 
clinical support infrastructure. 

In [45], the authors define DL as a subset of machine learning techniques 
that have recently been used in a variety of domains. It has been proven to 
outperform traditional methods in speech signals and visual object detection. 
Multiple processing layers in deep learning models are capable of learning 
important aspects of input that is originally raw, all without the need for 
a domain level competence. Conventional ML models, on the other hand, 
typically require a significant amount of domain expertise to extract features 
before performing classifications. CNN is a form of deep neural network 
(DNN) that is commonly employed with two-dimensional inputs like movies 
and images. Using millions of images as inputs, they can learn hundreds of 
thousands of items. 

The learning limit of CNN can be changed by changing the breadth and 
significance of the model. Besides the two-dimensional signs, CNNs are used 
with one-dimensional signs like electrocardiography (ECG) or sound signs. 
Commendable designing of CNN used for picture affirmation is made by 
clubbing various layers of handling units with differentiating occupations. 
The critical unit of CNN configuration is its convolutional layer which 
contains learnable channel banks that are started when a specific component 
is perceived. Max-pooling layer using CNN designing is used to lessen the 
quantity of limits and license overfitting [50]. Completely related layers 
overall follow series convolutional and max-pooling layers. 

These layers’ function is to act as a classifier for newly learned items. 
A wellbeing contextual analysis on an ECG characterization was also used 
by the authors in [45] to confirm a proposed design driven by CNN. In this 
case, the dynamic is aggregated at the edge, resulting in the client receiving 
notifications as a result of sickness recognition. The response speed and high 
accessibility are compared to a traditional IoT-based framework in which the 
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cloud server performs all of the estimations, after which the HiCH’s accuracy 
is determined. This provides a dynamic thought at the underlying phase of 
the observing and working throughout the checking. 


12.3.2 loT-Enabled CNN for Improving Healthcare Monitoring 
System 


Wearable sensors and long-range casual correspondence stages expect a 
basic part in giving one more system to assemble patient data for pow- 
erful clinical benefits checking. Regardless, predictable patient perception 
using wearable sensors makes a ton of clinical consideration data. More- 
over, the customer made clinical consideration data on long-range relational 
correspondence objections that come in tremendous volumes and are unstruc- 
tured. The current clinical consideration checking structures are not useful 
at eliminating critical information from sensors and individual-to-individual 
correspondence data, and they experience issues taking apart it sufficiently. 
Also, the ordinary ML approaches are adequately not to manage clinical 
consideration colossal data for anomaly assumption. Accordingly, an original 
medical services observing system dependent on the IoT climate and a major 
information investigation motor utilizing DL models is vital to definitively 
store and dissect medical services information and to further develop the 
arrangement exactness. 

Artificial intelligence (Al) allows machines to mimic human behavior, 
and ML algorithms are a subset of AI. In order to classify the given dataset, 
it includes training and learning components. ML, on the other hand, cannot 
perform effectively as the dataset grows in size and heterogeneity. As a result, 
contemporary data analytic research has concentrated on DL approaches, 
which continuously and reliably learn and classify massive volumes of data. 
Because of their effectiveness, CNN, auto-encoders, and their combinations 
are the often used models for monitoring assessment among the other frame- 
works in DL. Surveillance monitoring is becoming more important to ensure 
security in all types of buildings, from tiny businesses to major corporations. 
The IoT is currently used to facilitate faster and easier connectivity. This is 
because the majority of enterprises turn to the Internet to receive and transfer 
data. In addition, IoT devices deployed in remote sensing applications create 
a large amount of data. A Raspberry Pi, a CCTV camera, mobile devices, and 
other sensors are all part of the IoT circuit. 

The checking system includes recording the exercises to watch the 
unusual exercises alarming individuals. Public spots like shopping centers, air 
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terminals, and different spots where countless individuals meet are checked, 
which assists in getting people in general. Observing individuals fouling 
up things is called designated checking. A portion of the current checking 
devices are “smoke alarms,” “entryway counters,” and counting the tram 
travelers. In specific spots, observation checking utilizes electronic cameras, 
electronic listening gadgets, and building access cards to stay away from 
misconduct and ill-advised utilization of spots. Additionally, different ongo- 
ing applications are utilizing an observation checking framework. Among 
them are the following: (i) healthcare monitoring and controlling system; (ii) 
airport surveillance system; (iii) building and industry system; (iv) remote 
healthcare system; and (v) forest tracking system. 

As of late, the utilization of interpersonal interaction in the medical care 
industry has been quickly expanding. The informal community information 
can likewise be used to recognize different factors like enthusiastic status 
and accumulated pressure, which may be converted into the situation with 
a patient wellbeing. Individuals with diabetes and strange BP share their 
feelings and encounters with others on informal communication locales [52]. 
They share important data and inspire each other to battle against diabetes 
and high/low BP. Likewise, diabetes patients distribute their viewpoints 
about explicit medications [53]. Another patient sees the assessments of 
others and reacts to them about similar medications. Along these lines, the 
medical services observing frameworks for diabetes and unusual BP need 
interpersonal interaction information to distinguish enthusiastic unsettling 
influences in patients by utilizing their posts, and to screen drug incidental 
effects by utilizing drug surveys. Be that as it may, the data on interper- 
sonal interaction locales about persistent feelings and medication encounters 
are unstructured and unforeseen, and it would be a difficult errand for 
healthcare observing frameworks to extricate the data and investigate it to 
screen the patient’s psychological wellbeing and to anticipate drug incidental 
effects. In this manner, there is a need of shrewd methodologies that can 
consequently separate the most significant elements and diminish the dimen- 
sionality of the datasets for better exactness of medical services observing 
framework. 

Both healthcare and person-to-person communication information have 
significantly expanded in a couple of years, which is called huge information 
(both unstructured and organized) [54]. The conventional methodologies and 
ML strategies may not deal with these data very well for the extraction of 
significant data and for anomaly expectation. Moreover, these data may not 
help the medical care industry until they are handled keenly progressively. 
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This requires a major information cloud stage and a high level of profound 
learning approach, for example, CNN model [55]. 


12.4 The Challenges of Internet of Things Enabled 
Convolutional Neural Network in Healthcare Systems 


Because IoT is now also employed in real-world applications, QoS is directly 
tied to data quality, which can be used in decision support applications. The 
information taken from medical service sensors should be gathered, moved, 
prepared, broke down, and utilized on schedule; nonetheless, now and again, 
IoT gadgets cannot offer the required information at its appropriate time. It 
represents a few difficulties to the nature of loT medical service frameworks. 
Since clinical wearable systems work with critical second projects, they 
require a picky affirmation of QoS. It is a thrilling opening in the field of 
heterogeneous data collection, checking the patient in its continuous and 
supporting modified dynamic, according to QoS. As such, beating these 
hardships, according to QoS, is fundamental. 

Today, the expenses of medical care and therapy gadgets observe to be a 
higher priority than some other time. Thinking about the creator’s informa- 
tion, no relative review has been directed by the examination of IoT medical 
service frameworks’ expenses. As a result, we believe cost analysis is an open 
question in the IoT medical care framework. Even in wealthy countries, the 
high cost of testing equipment in the IoT medical service framework is a 
serious challenge. The IoT has not yet made treatment administrations more 
cost-effective for the general public. The rise in the expense of healthcare 
devices is a cause for concern for everyone [56]. 

Since a major volume of unstructured information is created, understand- 
ing them is extremely complex. The projects gather the information identified 
with the patient’s wellbeing consistently; so unique stockpiling systems are 
required contrasted with the typical datasets. Accordingly, one of the difficul- 
ties in this field is information stockpiling and the board. These have a major 
volume of data and incorporate a mind boggling, various, and rich ground of 
medical care data. Vulnerability in this information is exceptionally high. In 
this way, the improvement and viability of the legitimacy of the information 
identified with wellbeing and securing helpful information from them are 
troublesome. Henceforth, the examination in investigating these large data 
related with wellbeing in a medical services climate will be a need for settling 
on better clinical choices and incorporating required strategies. 
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Medical service program designers experience a few difficulties like secu- 
rity, protection, and dataset innovation; information protection and security 
are significant things in medical care programs dependent on IoT [57, 58]. 
Security can be characterized as legitimacy the board and setting some 
entrance rules to the patient’s projects and data. Information security in IoT- 
based medical care is of high significance Medical service program designers 
experience a few. Medical service programs are planned and created by 
getting data from IoT devices. Broad information that are moved and put 
away consistently can be hacked and taken advantage of in the patient and 
the doctor’s essence. These programmers can create counterfeit identifiers 
for purchasing drugs all together to take advantage of them. Security and 
protection and keeping away from data spillage are the primary worries in 
keeping up with the patient or the client’s data, guaranteeing information 
offering to others with the patient’s understanding [56, 58]. 


12.5 Framework for Internet of Things Enabled CNN in 
Healthcare Systems 


The proposed model has four layers having a wearable device, gateway, IoT- 
cloud layer with CNN processing model, and the user connectivity called 
application layer. The captured data can be processed using CNN model 
and the result is used to monitor and predict the cancer patients; thus, the 
specialist and caregivers can handle the treatment using related information 
in the dedicated server of the IoT-cloud database. The captured data from 
the patients can be sent to the cloud database through the Wi-Fi wireless 
connectivity gateway for the processing of the collected data in the cloud 
network. The CNN-based model was used to detect humans with cancer to 
produce more precise calculation and better accuracy. The system modeled 
contains the wearable devices connected to IoT cloud sensor nodes. The 
cancer patient is detected with the captured data using various devices stored 
in the cloud database. 

The data is sent to the data model and applies the CNN to detect and 
monitor the cancer patient with the use of the captured data stored in the 
cloud database by collecting the stored physiological signs and performing 
the task with the model. The user application layer is made up of / users 
who are human beings (cancer’s and non-cancer’s) in a smart environment. 
The data server sends the information to the clinical assessment whenever 
an emergency alarm is recognized. A signal processing unit is activated with 
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Figure 12.2 The architecture of the proposed system. 


IoT-based assistance in the data model to control parallel computation and 
cloud base for permanent access with the acquired data. A period successive 
investigation of the gathered examined information is acted in the equal 
handling, and, thus, the information is thought to be of the downrange of 
inspecting with 125 Hz. Figure 12.2 displays the proposed system framework. 


12.5.1 The Practical Application of the Proposed Framework 


12.5.1.1 Dataset 

The dataset “Wisconsin Diagnostic Breast Cancer (WDBC)” was made by 
Dr. William Wolberg at the University of Wisconsin and is accessible at the 
UCI AI store [59]. It was utilized as a dataset for execution of the proposed 
study for planning Al-based framework for the conclusion of bosom disease. 
The dataset has a size of 569 subjects with 32 qualities and 30 elements being 
genuine worth components. The objective yield mark analysis has two classes 
to address the threatening or harmless subject. The class dispersion is 357 
harmless and 212 threatening subjects. Subsequently, the dataset is a 569 x 
32 component framework. 


12.5.1.2 Dataset pre-processing 
Prior to applying the AI calculations for grouping issues, information han- 
dling is important. The handled information [60] diminished the calculation 
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season of the classifier and expanded the characterization execution of the 
classifier. Techniques, for example, missing worth location, standard scalar, 
and min—max scalar are broadly applied to the dataset pre-processing. Stan- 
dard scalar guarantees that each element has mean O and difference 1; 
consequently, all provisions have a similar coefficient. Min—max scalar moves 
the information so that all components are gone somewhere in the range of 0 
and 1. The element that has a vacant worth in the column is eliminated from 
the dataset. 
Y Y — min (Y) 
max (Y) — min (Y) 


(12.1) 


12.5.2 Convolutional Neural Network Classification 


A CNN model to group the malignant growth dataset was acquainted in this 
section with help in disease finding. The engineering of the proposed CNN 
model comprised essentially of three convolutional layers and a completely 
associated layer. Each convolutional layer was trailed by a batch normal- 
ization (BN) layer that standardized the yields of the convolutional layer, 
a rectified linear unit (ReLU) layer that applied an initiation capacity to 
its feedback esteems, and a maximum pooling layer that directed a down- 
examining activity. The proposed CNN embraced a normal pooling layer 
before the completely associated layer to diminish the elements of the com- 
ponent esteems contribution to the completely associated layer. Following the 
work by [61], a dropout pace of 0.5 was utilized between the normal pooling 
layer and the completely associated layer to keep away from overfitting and 
increment the exhibition. The proposed model additionally attempted spatial 
dropout between every maximum pooling layer and its after convolutional 
layer however tracked down that such dropouts brought about execution 
corruption [62]. Thus, the model did not matter spatial dropout. As info, 
the organization takes the component upside of a malignancy dataset, and 
it yields the likelihood that the example has a place with a specific class (e.g., 
the likelihood that the comparing patient has bosom disease). The occasions 
were taken care of into the model layer by layer. The contribution to each 
layer is the yield upside of the past layer. The layers perform explicit changes 
on the information esteems and afterward pass the handled qualities to the 
following layer. 

The deep learning employed in this chapter is called CNN for the classi- 
fication of various physiological symptoms and sighs of persons collected 
using wearable devices and sensors [63]. To obtain the initial value of 
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features, the dataset segmentation is done semantically, and the trained data 
labeled is collected as 


on the dataset. From eqn (12.2), x, represents the center class of the target 
size from the instances, and yx = (Yki, Yk2) +++; Yky) € Rt represents 
the label of the target instance. If x, belongs to the ith class if label. 
From eqn (12.2), x; denotes the center patches of the target pixel size of 
128 x 128, and yk = yg1,--..¥~ 4€R* denotes the label of the target pixels. 
If the label y, = 1, else y, = 0. The principle objective of the underlying 
assessment of CNN is addressed as a capacity f* and then it is viewed as that 
the fix X has four names and the capacity for preparing information as 


N 
pe argmin d (fs) — yll - (12.3) 
k=l 


In view of the CNN design, the capacity f is composed as 
J= FP FO se GO) (12.4) 


where f() addresses the jth layer on CNN. Likewise, f) perhaps any of 
the layers, for example, convolutional, pooling, or completely associated. In 
such an event, assuming we are utilizing f() the inclination and weight in f, 
the capacity is dictated by 0 the boundaries as 


0 = (Wi, bi,...)-. (12.5) 

Eqn (12.4) can be expressed as eqn (12.6), using eqn (12.5) as a starting point. 
1 Y 

0” = min y DE Ey yr 0) (12.6) 


where eqn (12.6) portrays, rather than deciding f*, the arrangement of 0* 
boundaries can be determined. L shows the misfortune work for cross-entropy 
used to assess the blunder happened between label y and p; (x, 0) yield 
acquired from each mark (class) as 


4 


L (xp, yu; 0) = — D> Yeilogpi (pi 0) . (12.7) 
i=1 


It returns the least-squares error for all training patches, as referred to in [64]. 
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12.5.3 Performance Evaluation Metrics 


The classifier was evaluated by computing various performance metrics using 
the following. 

The evaluations of the proposed classifier were assessed by processing 
the rates of sensitivity (SE), specificity (SP), and characterization precision; 
the separate definitions are as follows: 


a. Accuracy in % 


ll 
- Lia 7 E) 100 (12.8) 


1, if classif (t) =t.c 
0, otherwise i 
Here, T is the initial activities, t ET, t.c is the object's class, and class 
(t) provides the algorithm’s categorization of 7. 

b. Sensitivity (SE) in % 


accuracy (T) 


where assess (t;) = 


TP 
E = 100. 12. 
¡e Sia, 
c. Specificity (SP) in % 
TN 
E = ==; x1 12.1 
E 00 eo 


where TP, TN, FP, and FN signify the following: 


True positives: no cancer is classified as such; 

True negatives: correctly labels cancer patients as not cancer; 
False positives: no cancer is classified as cancer; 

False negatives: incorrectly labels cancer patients as cancer. 


12.6 Results and Discussion 


The planned technique has been implemented in R programming language 
software and the outcomes are evaluated to determine its efficacy. The R 
program is deployed on a system with an Intel Core 17-7th Gen CPU running 
at 3.40 GHz and 64 GB of RAM. To examine the effectiveness, the set 
is generated in matrix form and obtained during cancer screening. During 
the evaluation, the system used the screen’s input data together with other 
factors. To avoid false alerts, each class and module has a threshold level 
specified to it. 
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Table 12.1 Classification matrix for the proposed model 


Real/predicted Malignant Benign Total 
Exist 198 7 205 
Non-exist 13 351 364 
Total 205 364 569 
Table 12.2 Proposed method evaluation 
Dataset | Accuracy | Specificity | Sensitivity| PPV NPV | F-score | AUC 
WDBC | 98.4% 98.7% 98.9% 97.6% | 97.5% | 98.3% 0.992 


12.6.1 Performance Evaluation of the Proposed Model 


Table 12.1 shows the classification matrix for the proposed model employed 
in this study to classify the dataset. 

Table 12.2 summarizes the suggested model’s accuracy, specificity, AUC, 
sensitivity, positive/negative predictive value, and F-score characteristics. 
The proposed model has an accuracy of 98.4%, specificity of 98.7%, sensi- 
tivity of 98.9%, positive predictive value (PPV) of 97.6%, negative predictive 
value (NPV) of 97.5%, F-score of 98.3%, and AUC of 0.992 values, 
respectively. Figure 12.3 shows the performance evaluation of the proposed 
classifier. 


12.6.2 Performance Comparison of the Proposed Model 


The proposed model was compared with existing methods that used the 
same dataset to evaluate the performance of their model after implementing 
the technique. Table 12.3 shows the results of the comparison with other 
existing models. The various current models used DL techniques like dense 
convolutional network (DCN), multiple-weight SVM (MWSVM), genetic 
grey based neural network (G2NN), and fuzzy genetic grey-wolf based CNN 
model (FG2CNN) in classifying the cancer dataset. 

The proposed model was compared with recent works that used the same 
dataset especially those that used DL methods for cancer classification. The 
model with higher classification results used feature selection which the 
proposed model failed to use on the dataset; feature selection helps the highest 
model to classify the dataset better than other DL techniques without feature 
extraction. The highest model has an accuracy of 99.7%, specificity of 98.4, 
sensitivity of 98.7%, and F-score of 98.5%, respectively. The proposed came 
second with an accuracy of 98.4%, specificity of 98.7%, sensitivity of 98.9%, 
and F-score of 98.3%, respectively. The least performance classifier has an 
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PERFORMANCE EVALUATION 


0.992 


98.90% 


40% 
98.70% 


98. 
98.30% 


97.60% 


ACCURAGPECIFICSIENSITIVITY PPV NPV F-SCORE AUC 


Figure 12.3 Performance evaluation of the proposed model. 


Table 12.3 Performance comparison of the model with other existing methods using the 
same dataset 


Models Accuracy Specificity Sensitivity) F-score 
MWSVM [65] 98.5 96.2 97.9 98.8 
G?NN [65] 98.9 97.2 97.8 97.4 
Fuzzy + G?CNN [65] 99.7 98.4 98.7 98.5 
DCN [66] 94.9 94.5 96.9 96.0 
KNN [67] 96.5 95.7 95.9 96.2 
Proposed model 98.9 98.7 98.4 97.9 


accuracy of 94.9%, specificity of 94.5, sensitivity of 96.9%, and F-score 
of 96.0%, respectively. The results show that the proposed did not perform 
badly and yielded a better classification using various metrics with other DL 
models. 


12.7 Conclusion and Future Directions 


One of the dangerous illnesses among women in the world today is breast 
cancer. Researchers have tried to use ML models for the classification of 
breast cancer, but their accuracy and efficiency still remains questionable. 
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Hence, DL techniques have been proposed for high accuracy and efficiency 
especially when huge amount of data is involving. DL models became 
popular in various fields for the classification of huge data especially in 
healthcare systems. The accuracy and efficiency of CNN classifier has been 
used in classification of various illnesses. Therefore, this chapter proposed an 
intelligent loT-based enabled CNN for the classification of breast cancer for 
better accuracy and efficiency and, thus, will help medical doctors to detect 
and treat cancer patient in real time and remotely. The performance of the 
proposed system using various metrics shows that the CNN performance 
better compare with existing methods with an accuracy of 98.4%, specificity 
of 98.7%, sensitivity of 98.9%, and F-score of 98.3%, respectively. Future 
work will consider the use of feature selection to select revenant features by 
removing unwanted and unproductive features from the dataset. The security 
and privacy of the loT-based systems will also be taken into consideration to 
enable the users have total trust in the use of the loT-based systems. 
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